EngineSmith's Blog

Engineering Craftsman

Archive for September, 2010

Sherweb: worst hosting experience

Posted by EngineSmith on September 30, 2010

IT is hard, getting email/calendar right is super painful, especially for a startup. Lessons learned.

Sherweb, who claims to be the largest Exchange hosting company, dropped its pants completely. After a whole weekend’s of outage and zero explanation, our corporate communication is close to death. With 68 people waiting, we jumped to Google Apps. Not surprisingly, there are not many complaints about their lame service (I guess they shut-up most voices), and their network status is always everything is fine.

  • You can never reach any human on their support phone line
  • You never get any kind of ahead of time notice, or explanation
  • You have tell their customer support that things are not working for you. Oh, you need to provide them a test account yourself to verify that
  • Every Friday night they have a 2-hour maintenance window
  • A 10+ hour outage to them is just, whatever

Luckily we moved away from them. Don’t ever trust them!

Posted in Misc, Operations | 15 Comments »

WTF, $578M for the most expensive public school

Posted by EngineSmith on September 13, 2010

$578M for a public school. Check the luxury.

WTF. Who voted YES to the bonds to build such a public school? I really want to talk to them. No wonder California’s education system is collapsing. My gosh, look at what democracy has done.

“That $578 million cost is more expensive than the Bird’s Nest stadium built for the 2008 Olympic Games in Beijing, China, which cost $500 million.”

Posted in Kids | Tagged: | Leave a Comment »

How to argue against using cloud?

Posted by EngineSmith on September 12, 2010

I have been constantly asked the question: why don’t we use cloud? (Surprisingly also many from business/sales people, Amazon etc. really did a good job in PR because I would never ask my VP of Sales “did you try this sales methodology?”).

I used to argue on many different reasons, like complexity, no guarantee of service, slowness, security etc. Then I got tired, and found out this simple, most effective way, just consider one MySQL instance (we have 8-core, 32GB memory, purely Intel X-25E SSD drives for MySQL, about $8K). On Amazon EC2, this is at least comparable to this:

“High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform”

How much does it cost on EC2 to run it for 2 years? $1/hour * 24 * 365 * 2 = $17,520. EC2 pricing. Also, EC2 doesn’t save your on personnel cost, you still need to have ops engineers (physically deal with your own server will take 2 extra hour than EC2 I guess 🙂

To run my own server in colo for two years, you might add $200/mo * 24 = $4,800 operation costs (rack, power, cooling). That is still only $12,800 total. End of discussion.

Needless to say, I have 4 more cores, and at least 3x better I/O performance, which also means less nodes are required, as well as simpler application logic and operations.

Oh, did I mention that EC2 is a wild zoo? Recently people are telling me that they can’t keep a TCP connection open for more than 10 minutes there. Imagine what that will do to your system.

Posted in Operations | Tagged: | 1 Comment »

Cassandra fiasco – NoSql kills

Posted by EngineSmith on September 12, 2010

We implemented one of our product purely on top Cassandra back in March 2010 with version 0.5.1. It went live in May, and tanked in the first week due to crazy traffic. Three painful, sleepless weeks later, Cassandra finally dropped its pants and started corrupting our data. We shutdown the site, spent three full days, converted backend to MySQL clusters. Since then, our product has grown steadily until now (September 2010), with 8M+ users and 1M+ DAU (daily active user).

I don’t want to restate all the great benefits Cassandra provides, we were so excited by that. Unfortunately, if it is too good to be true, it usually is:

  • We started with a 6-node Cassandra cluster, big mistake. Some people later suggested to start with 50+ nodes. 🙂 You will see the reason why below. Yes, it may be web-scale, but not for a small startup I guess.
  • We set quorum=2, replication=3, which means 5 nodes are minimum. Unfortunately, during the initial weeks, we lost one out of six node, so our ability to stand for node failures are pretty low.
  • Cassandra has a “binary-log” kind of mechanism, it appends to that file(s) until it reaches a limit. Then it needs to compress compact that big file, during which, this node is kind of “unresponsive”. All nodes uses a protocol “Gossip” to communicate, if one node is unresponsive, all its neighbours will think it is dead. (Even without node failure, very often we see discrepancy of the topology from each node’s view).  When this compression compaction happens at high traffic, we were seeing cascading failures across the ring all the time.
  • Later people suggested to use different disk partitions for “binary-log” and data. Well, that will require at least 4 disks on a server (mirrored 2 for each, and probably you will want another pair for OS, I don’t think we are crazy enough to run on single disks), and I wouldn’t call it a “commodity hardware” any more.
  • At the time (May 2010 version 0.5.1) there is only ONE very few I/O parameter you can tune for Cassandra, it makes you doubt: hmm, I wonder why MySQL spent the last 10+ years have those many I/O tuning mechanisms?

I felt lucky we dumped it early on, at the moment people were asking us: why Digg/Facebook/Twitter all jumped on Cassandra and you guys ditched it? Maybe you are not competent enough? Turns out later the reality is:

Recently we also hear often: why don’t you use MongoDB, it does everything for you, and xxx is using it. If you pay just a little bit attention, MongoDB sacrifice your data consistency for performance, big time, almost like cheating:

Enjoy this one, “/dev/null is web scale“.

Posted in Engineering, Operations | Tagged: | 20 Comments »

redis – cache done right

Posted by EngineSmith on September 11, 2010

Finally we got into the situation to consider a cache system (our system is write-most, thus cache was not originally the main concern). After some research, we settled on redis instead of memcached. Here are many things redis has done right:

  • atomic operations – no more complicated locking logic in application code
  • support list, set etc more data structures – simpler application code
  • control your cache’s TTL – guaranteed persistence, virtual memory, snapshot (crash recovery)
  • fast, slim and simple – do one thing very well. The source code only relies on make and gcc, no other dependencies

There is no clustering support in redis itself (or its Java client lib yet), however, it is not hard to write your own consistent hashing algorithm to use several redis instances together. And I really hope redis won’t complicate itself in that direction (or maybe make a separate library for it, just like CouchDB uses Lounge. To me, “do one thing and do it really well” is the best way to go).

Another candidate we considered was membase. It claimed to have done everything (clustering, re-balancing, persistence), which is actually a bit scary. Just like our Cassandra fiasco in May 2010, well, I will write up that lesson a bit later.

We are going to roll redis in production next week, will keep you posted about how it goes.

Posted in Operations, Software | Tagged: | 3 Comments »

Sorting, basics for a programmer?

Posted by EngineSmith on September 11, 2010

One of my favorite interview question is to write the simplest function to sort the characters in a string, no gimmicks, no tricks, just to see if the code is clean, compilable and easy to follow. i.e. a simple bubble sort will do, simple? Don’t laugh at it yet.

Surprisingly, there is a 80%+ failure rate (wrong algorithm, major errors in the code etc.) in the last 5 years within 100+ candidates, especially some very senior, architect level guys. Within the ones who finished it correctly, I probably hired most of them (of course, they have to pass other stuff too).

Several interesting observations:

  • Many senior guys got offended if they can’t get it right. The attitude shown is much more important than the question itself.
  • Most US educated engineers will use “merge sort” first (to be honest, the concept is simple, but the code is long, and many failed at merging two arrays of characters or recursions). Foreign engineers usually pick bubble sort, super simple, you can finish it in less than 10 lines.
  • So far only one guy wrote “quick sort” in front of me. He got it completely right, and can also briefly tell me why it is efficient. I think he is just well prepared, but still showed his efforts.
  • For some, sorting is such an alien concept. I wonder if I pick a guy from the street without any computer science education, he at least may be able to describe how he can sort a bunch of stuff. Many engineers couldn’t even start the thought process.

Here is an interesting thing about sorting, each algorithm has its own unique sounds!

Posted in Engineering, Software | Tagged: | 1 Comment »

MMM fail – MySQL Master Master

Posted by EngineSmith on September 11, 2010

We have been using MMM for over a year now in Production to manage master-master MySQL cluster. It worked out pretty well for a while, until recently we had couple outages related to it and realized its limitations.

MMM basically manages VIPs (virtual IP) and associates roles with MySQL nodes in the cluster. In case of two nodes, you configure two VIPs: one for reader, one for writer (if you do read-write split, which we did before). In case of failure, or replication delay is too long, MMM is going to switch the VIP to point the right host. So what’s the catch?

  • First of all, the VIP is configured on each MySQL node itself. In order to make the switch, MMM needs to SSH to the node, and change its network config to release or take the VIP.
  • In order for the switch to happen, MMM actually expects to issue some MySQL commands to the node (like clean up, or set read-only etc)

What is wrong with those? MySQL node can fail, when things fail, they can fail in ANY POSSIBLE WAYS. We had two outages recently, once due to a hardware failure, the MySQL node can’t accept any SSH connections; another time, the MySQL instance was hosed and it held the MMM request and didn’t respond at all (it would have worked if it actually rejected the request). In both scenarios, MMM basically got stuck (internal states were all messed up) and didn’t do its job. Of course outages happened, 3-5 AM in the morning. 😦

What is the lesson? The fail-over mechanism should NEVER rely on the operations on the fail-able nodes, and states should be kept outside as well. The simpler the fail-over mechanism, the better off you are (KISS rule). Here are some thoughts we are still experimenting:

  • Don’t use VIP. Instead, let application side handle fail-over (i.e. specifying two physical IP addresses, if one fail, use the other). Monitor script activate/de-activate network switch port to implement fail-over (hardware network path is guaranteed to be a boolean). Most MySQL server node has two network ports, use one for application access, another one for replication/management.
  • Using LVS to manage VIP. Seems still too complicated, and I am concerned about the split-brain scenario, and what kind of availability it provides. The worst case is both MySQL nodes are taking traffic and you don’t even know about it.
  • Using hardware to manage VIP, the simplest, efficient and guaranteed way. We are considering using F5 BigIP. The issue is cost and bandwidth limit.

We are still investigating, will keep you posted. By the way, forgot to mention, we currently have 12 nodes in 6 clusters, all of them running on SSD drives, with peak 3-4 QPS per cluster.

Posted in Engineering, Operations | Tagged: | 2 Comments »

MySQL read write split myth, and why I wouldn’t use it

Posted by EngineSmith on September 11, 2010

In MySQL scale-out practices, many people chose to split their read and write to different nodes. This may work well for a read-most environment, but actually significantly painful and wrong in write-most settings, like ours.

  • Two MySQL nodes, master-master replication
  • Only the active node is taking writes (writer)
  • Two readers, round-robin between the two nodes
  • MMM is used to monitor replication delay and handle role switching

This seems to be the common sense approach, make sure only one node is writer to guarantee consistency, while utilize both nodes as readers to share the load. We did that for the last 6 months, and found out in practice this is wrong:

  • Plan your capacity for the worst case scenario. The point of using two nodes is to handle fail-over. When one node dies, all read and write traffic will go to the other node. You should plan to have one node handling all traffic, the read/write split gives you an illusion that you have the capacity until the failover, it is too dangerous.
  • Splitting read/write actually made your application logic super complicated due to replication latency. Regardless of your tolerance level, at some critical point, you have to write some ugly code like this:  a = reader.get(); if (a == null) a = writer.get();
  • MMM is not designed very well to handle role switching during failure cases (I will write another post about it later), using two roles make the situation fairly messy.

So, here is what we did recently, in one cluster always have only one active node taking both reads and writes. MMM only handles fail-over (not replication delay). Then over-sharding on the cluster (prepare to split the shard in case it can’t handle the load). Life is much much simpler now.

Posted in Engineering, Operations | Tagged: , , | Leave a Comment »