EngineSmith's Blog

Engineering Craftsman

Archive for the ‘Operations’ Category

SSD Rules

Posted by EngineSmith on June 22, 2011

Deal with your own servers with freaking bad-ass disk I/O can save your infrastructure complexity and sometimes a life saver for a startup. We have been using Intel X-25E SSD drives as our only MySQL storage for a year now. And it is simply amazing and cost effective. If you want to make your life a living hell and deal with turtle speed I/O, tons of EC2 instances and constant fails, try EC2.

Here is a video talking about the big win of SSD. My favorite quote: “50TB SSD in one machine for $80K, and you can fucking skip sharding”.

Posted in Engineering, Hardware, Operations, Startup | Tagged: , , , | 2 Comments »

“Be sexy, go down”

Posted by EngineSmith on April 18, 2011

This video from MySQL 2011 conference is hilarious. It has been so true in this industry, sadly: if you don’t have down time, people think you are not sexy (at least not trying to solve hard problems). If you go down all the time, and publicly talk about the “hard” problems you are facing and all the sexy new products you are trying, no matter how ridiculous it is, you win. Scale fail is the new PR.

A successful operations team should be quiet, to an extent you don’t even realize they exist. Unfortunately, that’s when people start taking them for granted.

Posted in Operations | Tagged: | Leave a Comment »

V is for Victory: Vertica and VoltDB

Posted by EngineSmith on March 25, 2011

Both are Michael Stonebraker’s startup, and we recently just adopted them both. The experience so far? Amazing.

We setup a 6-node Vertica cluster in one week with data loading process as well Tableau generating reports. The simplicity and efficiency is just mind-blowing, comparing of our previous failed Hadoop based analytics project, this one is just a breeze. Of course, it has a price tag, but frankly, TOTALLY worth it! Much better than spending several engineers on it for months and still get a half-baked, super complicated and almost violent Hadoop analytics platform (Hadoop is not for the weak minded, small budgeted and resource limited startup).

  • Last week, by mistake, we loaded 10 billion rows of duplicated data into our Vertica cluster. It was still running, though a bit slow. 🙂
  • The rich analytic functions are super powerful. A path analysis (analyzing the page/click flow among all users) takes just 4 seconds over 60M rows.

If you are a startup seriously considering analytics, try Vertica before you waste all your money/resource on Hadoop/ETL/Data Warehouse solutions. They are not bad products, just too complicated. With Vertica’s powerful feature sets and linear scalability, you can simplify your data flow significantly. You will realize that start-schema is just over-rated, you can write super complicated but still blazing fast queries over de-normalized schema. SQL is just lovely (while map-reduce is just painful and awkward, one lesson we learned is: who can verify that the map-reduce code IS doing the right thing any way?).

By the way, if anybody tries to deal with big-data with some solution which can only run on one physical machine. Stop the joke.

VoltDB, it is a proper ACID SQL database on steroid. Who said you have to sacrifice consistency for scalability? We have had enough bad experiences with bad NoSQL products (see my previous posts). VoltDB is simply a god-send. Of course you have to loose something (like schema change needs restart the whole cluster – they are working to improve this, and you have write stored procedures instead of ad-hoc SQL), I think those are totally reasonable: honestly, who will die if you shutdown your site for 10 minutes a week for maintenance?

VoltDB is an in-memory only database (using k-safety and snapshot for redundancy) with linear scalability (proven). We were considering Redis for persistence for a bit (it also is in-memory with replication and snapshot), however,  Redis is swinging its directions between support clustering (transparent sharding) or disk persistence (to me a total disaster to go this direction). Since it is not settled, and manual sharding is just a big no-no, we settled on VoltDB.

We decided to pay for VoltDB support even though the community edition is perfectly enough for our purpose (we do operations through our own scripts instead of web GUI anyway). Also, really wanted to contribute to them to keep up the wonderful work.

Posted in Operations, Software | Tagged: | 2 Comments »

MySQL JDBC Multiple Query Trap

Posted by EngineSmith on January 20, 2011

This is a pretty big lesson recently. We are using MySQL JDBC driver against 5.1 Percona build. For a simple use case, we decided to use transaction (though we know it doesn’t scale well). Though stored procedure can be used, we chose to use a trick in JDBC driver: allowMultipleQueries. http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html

Basically, the following statements are put together as one iBatis query: begin; insert into A …; update B where ….; end;

This thing worked perfectly fine for several months, until one day, after new release, suddenly our MySQL started having trouble, it began to show errors of deadlocks. Tons of random queries are grouped together into big transactions and causing deadlocks. We scratched our head for a long time since absolutely nothing has changed in that release related to this transaction logic, neither did we touch MySQL or change JDBC driver. Seems the transaction boundary was extended randomly outside of the those two above queries, and since we use connection pooling (BoneCP), consequent queries on the same connection were combined into big transactions. This is really terrible!

Eventually we took out this allowMultipleQueries trick and did everything in plain two step SQLs (yeah, no transactions). Until today, we still don’t know exactly what triggered the problem since it was working fine for several months.

By the way, a side note about transactions. It sounded like a perfect solution on paper, in reality, especially web world (where you have sharding, many database nodes), it doesn’t work. One perspective to look at is: modern hardware/network is quite reliable nowadays, comparing to the cost to ensure transaction, probably it is better to spend the time/money on ways to fix things (tool, customer support, reconciliation) if a transaction was interrupted. It will also make your system much simpler and easier to scale.

Forgot to mention, in 2004, Paypal had an outage for almost a week due to some smart guys introduced two-phase commit into their Production system. Great idea to guarantee ACID, also a text book example use case (banking transaction). Sadly, in reality, it doesn’t work.

Posted in Engineering, Operations, Software | 3 Comments »

Redis rocks!

Posted by EngineSmith on October 28, 2010

Several weeks ago I wrote Redis – Cache done right. Since then, we deployed a cluster of 6 Redis nodes in Production. The result: simply awesome!

It was a smooth ride, billions of cache hits every day, very low CPU load. The application logic is super simple, no fancy locking etc required since list/set are native data structure with atomic operations. We implemented our own distributed hashtable algorithm (consistent hashing) to distribute keys into the cluster (borrowed from memcached’s Java client). Our MySQL database load has been dropped a lot, as well as we now caches lots of Facebook call results (since their API sucks, high failure rate, and average 2-10 seconds response time).

By the way, we didn’t use snapshot, append-only log etc yet. For now it is just a read-only cache. Later on we will use those fancy stuff as well.

The only issue we ran into was: we forgot to set maxmemory, Redis happily hit the physical as well as swap limit, and crashed the whole server. 🙂 Since CPU load is low, we actually now runs 3+ instances on each physical machine to form 3 individual clusters.

To sum it up, love this guy’s twitter.

Posted in Engineering, Operations, Software | Tagged: | 4 Comments »

Sherweb: worst hosting experience

Posted by EngineSmith on September 30, 2010

IT is hard, getting email/calendar right is super painful, especially for a startup. Lessons learned.

Sherweb, who claims to be the largest Exchange hosting company, dropped its pants completely. After a whole weekend’s of outage and zero explanation, our corporate communication is close to death. With 68 people waiting, we jumped to Google Apps. Not surprisingly, there are not many complaints about their lame service (I guess they shut-up most voices), and their network status is always everything is fine.

  • You can never reach any human on their support phone line
  • You never get any kind of ahead of time notice, or explanation
  • You have tell their customer support that things are not working for you. Oh, you need to provide them a test account yourself to verify that
  • Every Friday night they have a 2-hour maintenance window
  • A 10+ hour outage to them is just, whatever

Luckily we moved away from them. Don’t ever trust them!

Posted in Misc, Operations | 15 Comments »

How to argue against using cloud?

Posted by EngineSmith on September 12, 2010

I have been constantly asked the question: why don’t we use cloud? (Surprisingly also many from business/sales people, Amazon etc. really did a good job in PR because I would never ask my VP of Sales “did you try this sales methodology?”).

I used to argue on many different reasons, like complexity, no guarantee of service, slowness, security etc. Then I got tired, and found out this simple, most effective way, just consider one MySQL instance (we have 8-core, 32GB memory, purely Intel X-25E SSD drives for MySQL, about $8K). On Amazon EC2, this is at least comparable to this:

“High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform”

How much does it cost on EC2 to run it for 2 years? $1/hour * 24 * 365 * 2 = $17,520. EC2 pricing. Also, EC2 doesn’t save your on personnel cost, you still need to have ops engineers (physically deal with your own server will take 2 extra hour than EC2 I guess 🙂

To run my own server in colo for two years, you might add $200/mo * 24 = $4,800 operation costs (rack, power, cooling). That is still only $12,800 total. End of discussion.

Needless to say, I have 4 more cores, and at least 3x better I/O performance, which also means less nodes are required, as well as simpler application logic and operations.

Oh, did I mention that EC2 is a wild zoo? Recently people are telling me that they can’t keep a TCP connection open for more than 10 minutes there. Imagine what that will do to your system.

Posted in Operations | Tagged: | 1 Comment »

Cassandra fiasco – NoSql kills

Posted by EngineSmith on September 12, 2010

We implemented one of our product purely on top Cassandra back in March 2010 with version 0.5.1. It went live in May, and tanked in the first week due to crazy traffic. Three painful, sleepless weeks later, Cassandra finally dropped its pants and started corrupting our data. We shutdown the site, spent three full days, converted backend to MySQL clusters. Since then, our product has grown steadily until now (September 2010), with 8M+ users and 1M+ DAU (daily active user).

I don’t want to restate all the great benefits Cassandra provides, we were so excited by that. Unfortunately, if it is too good to be true, it usually is:

  • We started with a 6-node Cassandra cluster, big mistake. Some people later suggested to start with 50+ nodes. 🙂 You will see the reason why below. Yes, it may be web-scale, but not for a small startup I guess.
  • We set quorum=2, replication=3, which means 5 nodes are minimum. Unfortunately, during the initial weeks, we lost one out of six node, so our ability to stand for node failures are pretty low.
  • Cassandra has a “binary-log” kind of mechanism, it appends to that file(s) until it reaches a limit. Then it needs to compress compact that big file, during which, this node is kind of “unresponsive”. All nodes uses a protocol “Gossip” to communicate, if one node is unresponsive, all its neighbours will think it is dead. (Even without node failure, very often we see discrepancy of the topology from each node’s view).  When this compression compaction happens at high traffic, we were seeing cascading failures across the ring all the time.
  • Later people suggested to use different disk partitions for “binary-log” and data. Well, that will require at least 4 disks on a server (mirrored 2 for each, and probably you will want another pair for OS, I don’t think we are crazy enough to run on single disks), and I wouldn’t call it a “commodity hardware” any more.
  • At the time (May 2010 version 0.5.1) there is only ONE very few I/O parameter you can tune for Cassandra, it makes you doubt: hmm, I wonder why MySQL spent the last 10+ years have those many I/O tuning mechanisms?

I felt lucky we dumped it early on, at the moment people were asking us: why Digg/Facebook/Twitter all jumped on Cassandra and you guys ditched it? Maybe you are not competent enough? Turns out later the reality is:

Recently we also hear often: why don’t you use MongoDB, it does everything for you, and xxx is using it. If you pay just a little bit attention, MongoDB sacrifice your data consistency for performance, big time, almost like cheating:

Enjoy this one, “/dev/null is web scale“.

Posted in Engineering, Operations | Tagged: | 20 Comments »

redis – cache done right

Posted by EngineSmith on September 11, 2010

Finally we got into the situation to consider a cache system (our system is write-most, thus cache was not originally the main concern). After some research, we settled on redis instead of memcached. Here are many things redis has done right:

  • atomic operations – no more complicated locking logic in application code
  • support list, set etc more data structures – simpler application code
  • control your cache’s TTL – guaranteed persistence, virtual memory, snapshot (crash recovery)
  • fast, slim and simple – do one thing very well. The source code only relies on make and gcc, no other dependencies

There is no clustering support in redis itself (or its Java client lib yet), however, it is not hard to write your own consistent hashing algorithm to use several redis instances together. And I really hope redis won’t complicate itself in that direction (or maybe make a separate library for it, just like CouchDB uses Lounge. To me, “do one thing and do it really well” is the best way to go).

Another candidate we considered was membase. It claimed to have done everything (clustering, re-balancing, persistence), which is actually a bit scary. Just like our Cassandra fiasco in May 2010, well, I will write up that lesson a bit later.

We are going to roll redis in production next week, will keep you posted about how it goes.

Posted in Operations, Software | Tagged: | 3 Comments »

MMM fail – MySQL Master Master

Posted by EngineSmith on September 11, 2010

We have been using MMM for over a year now in Production to manage master-master MySQL cluster. It worked out pretty well for a while, until recently we had couple outages related to it and realized its limitations.

MMM basically manages VIPs (virtual IP) and associates roles with MySQL nodes in the cluster. In case of two nodes, you configure two VIPs: one for reader, one for writer (if you do read-write split, which we did before). In case of failure, or replication delay is too long, MMM is going to switch the VIP to point the right host. So what’s the catch?

  • First of all, the VIP is configured on each MySQL node itself. In order to make the switch, MMM needs to SSH to the node, and change its network config to release or take the VIP.
  • In order for the switch to happen, MMM actually expects to issue some MySQL commands to the node (like clean up, or set read-only etc)

What is wrong with those? MySQL node can fail, when things fail, they can fail in ANY POSSIBLE WAYS. We had two outages recently, once due to a hardware failure, the MySQL node can’t accept any SSH connections; another time, the MySQL instance was hosed and it held the MMM request and didn’t respond at all (it would have worked if it actually rejected the request). In both scenarios, MMM basically got stuck (internal states were all messed up) and didn’t do its job. Of course outages happened, 3-5 AM in the morning. 😦

What is the lesson? The fail-over mechanism should NEVER rely on the operations on the fail-able nodes, and states should be kept outside as well. The simpler the fail-over mechanism, the better off you are (KISS rule). Here are some thoughts we are still experimenting:

  • Don’t use VIP. Instead, let application side handle fail-over (i.e. specifying two physical IP addresses, if one fail, use the other). Monitor script activate/de-activate network switch port to implement fail-over (hardware network path is guaranteed to be a boolean). Most MySQL server node has two network ports, use one for application access, another one for replication/management.
  • Using LVS to manage VIP. Seems still too complicated, and I am concerned about the split-brain scenario, and what kind of availability it provides. The worst case is both MySQL nodes are taking traffic and you don’t even know about it.
  • Using hardware to manage VIP, the simplest, efficient and guaranteed way. We are considering using F5 BigIP. The issue is cost and bandwidth limit.

We are still investigating, will keep you posted. By the way, forgot to mention, we currently have 12 nodes in 6 clusters, all of them running on SSD drives, with peak 3-4 QPS per cluster.

Posted in Engineering, Operations | Tagged: | 2 Comments »