Posted by EngineSmith on June 22, 2011
Deal with your own servers with freaking bad-ass disk I/O can save your infrastructure complexity and sometimes a life saver for a startup. We have been using Intel X-25E SSD drives as our only MySQL storage for a year now. And it is simply amazing and cost effective. If you want to make your life a living hell and deal with turtle speed I/O, tons of EC2 instances and constant fails, try EC2.
Here is a video talking about the big win of SSD. My favorite quote: “50TB SSD in one machine for $80K, and you can fucking skip sharding”.
Posted in Engineering, Hardware, Operations, Startup | Tagged: database, mysql, performance, ssd | 2 Comments »
Posted by EngineSmith on April 4, 2011
Clipper Card is a new San Francisco bay area transportation payment system intended for all main public transportation systems, including Caltrain, BART (subway), VTA (buses) etc. Well, it was a complete disaster so far and I believe it has marked the death of public transportation development in the heart of Silicon Valley (quite ironic, right?). The whole product is simply a joke, my experience today can give you something fun to enjoy.
I ride Caltrain daily and buy monthly pass on my Clipper Card. The rules are:
- You have to tag on and off once ONLY on the first day of month when you travel in order to activate your monthly pass. No idea why this is necessary, maybe they just want to save one cron job (technical term for scheduled job). If you don’t do this, Caltrain conductor will read that you do NOT have a valid ticket and will either kick you off the train or give you a citation ($250 minimum), even though I think they can see clearly on their device that you have an “inactive” monthly pass on your card.
- If you tag on in source station and forget to tag off in destination station, Clipper Card will charge you the maximum fare possible from your originating station. Basically they think you are most likely cheating in such cases, and should pay the penalty (“you are assumed guilty”).
Okay, if you are still reading, I take you are well educated enough to understand those rules, as I was until this morning.
- Luckily without a reminder, I remembered to tag on at 10:30 AM on 4/4/2011 (since I didn’t go to work on 4/1/2011). By the way, the card reader message showed absolutely nothing about my monthly pass, it acted as if it just deducted the maximum fare of $8.50 from my card with a remaining balance displayed. The theory is that when I tag off, they will refund my $8.50, and show a very vague “Pass OKAY” message indicating the monthly pass has been activated.
- Sadly, I forgot to tag 0ff (how can I remember to do it once in 30 days). By the time I remembered, it was already 3 PM. Thinking there might be a time window allowed to travel (some say it is 6 hours), I rushed back to the station and tagged.
- The super intelligent machine showed something meaning “you just opened another trip, $9.00 has been deducted from your card”.
- “WTF!” I was stunned there, looking around, nobody can help (and I dare not to tag the card again trying to cancel this trip like many of you might be thinking). 🙂
Went back to my office, checked on the online access, right, they charged me $8.50 in the morning already, and now have a new trip of $9 charged. Now the only hope is to call their customer support line to talk to a human. Turns out the time window allowed for travel is only 4 hours, and that’s why they charged me (assuming I am cheating, even though I have a monthly pass on the card). “Congratulations though, your monthly pass has been activated.” – that is the exactly the words from the customer support guy.
I had to call them back tomorrow to get both charges refunded, as an one time courtesy, saying the support. I know, they think I am really stupid to “forget” to do such a simple thing once a month.
Frankly, this is the most retarded software system I have ever seen. With the help of Clipper Card, the bay area already terribly in deficit public transportation systems may die much quicker. As a consumer, I don’t really care what kind of complicated problems they are trying to solve, if it makes things harder and messier, it is a fail. From design perspective, what went wrong?
- The purpose of the system is to simplify people’s life. You can’t push the burden to ordinary users to “remember” and “apply” your complicated business flow (if …. else … then ….) just because you are lazy to make it simpler. I have a monthly pass, thus making me remember making an exception once per month is totally un-acceptable.
- You can’t assume everyone is cheating, they are your valued customers and they are human. Human forgets and makes mistakes all the time. If in your system you know I have a monthly pass, why you still charge me? Guess what, everyone will call your service department for a refund. Do you know how much it costs to take one call? My guess is around $50. Also, those calls won’t improve your service ratings, since it is merely remedying the stupid design flaws in your system in the first place.
- Customer Support should NEVER the only way to solve a problem. Surprisingly, you can only charge your card in Walgreens store, or online, but you can’t find out what’s going on through them. There is not a single device out there can help you to manage your card (like ATM). If you make a mistake, in any form, you have to call, or risk a citation (which you may have to do, since next train maybe an hour away).
By the way, just remembered another funny experience in BART ticket system many years ago (not sure if it is still the case today): I inserted a $20 bill into the ticket machine, instead of directly asking me where I want to go, or how many tickets I want, it gave me a list of options to select: do you want to buy: 5 tickets of $4? or 4 tickets of $4.5? etc. Wow, very intelligent machine, I have to be good at math to understand what you meant. However, why don’t you listen to what I want instead?
Posted in Engineering, Software | Tagged: bart, caltrain, clippercard, design, ticket, transportation | 2 Comments »
Posted by EngineSmith on February 14, 2011
Luxury car has more than 100M lines of code, while F-22 Raptor has only 1.7M lines of code. Mercedes-Benz S-class only the navigation system contains more than 20M lines of code.
From somewhere else I found that average car from Ford contains more than 10M lines of code. I think we are doomed with this trend. Everybody seems like to reinvent the wheel again and again, no matter how lame they are with it.
Same thing happening in the web world, every several months, there are some new kids on the block trying to solve an old problem with a completely new approach. “You don’t throw away the baby with the bath water”, there is some quote like from from Drizzle project (the re-vamp of MySQL project), if I remember correctly. So many NoSQL products are trying to replace the existing rock solid MySQL, backed by remarkable marketing hype (and propaganda campaign machines). Only time will tell, just like how MySQL survived so many years.
[update] A colleague found this link, amazing: F22 got zapped by International Date Line.
Posted in Engineering, Software | Tagged: car, drizzle, fighter, mysql, nosql | Leave a Comment »
Posted by EngineSmith on January 20, 2011
This is a pretty big lesson recently. We are using MySQL JDBC driver against 5.1 Percona build. For a simple use case, we decided to use transaction (though we know it doesn’t scale well). Though stored procedure can be used, we chose to use a trick in JDBC driver: allowMultipleQueries. http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html
Basically, the following statements are put together as one iBatis query: begin; insert into A …; update B where ….; end;
This thing worked perfectly fine for several months, until one day, after new release, suddenly our MySQL started having trouble, it began to show errors of deadlocks. Tons of random queries are grouped together into big transactions and causing deadlocks. We scratched our head for a long time since absolutely nothing has changed in that release related to this transaction logic, neither did we touch MySQL or change JDBC driver. Seems the transaction boundary was extended randomly outside of the those two above queries, and since we use connection pooling (BoneCP), consequent queries on the same connection were combined into big transactions. This is really terrible!
Eventually we took out this allowMultipleQueries trick and did everything in plain two step SQLs (yeah, no transactions). Until today, we still don’t know exactly what triggered the problem since it was working fine for several months.
By the way, a side note about transactions. It sounded like a perfect solution on paper, in reality, especially web world (where you have sharding, many database nodes), it doesn’t work. One perspective to look at is: modern hardware/network is quite reliable nowadays, comparing to the cost to ensure transaction, probably it is better to spend the time/money on ways to fix things (tool, customer support, reconciliation) if a transaction was interrupted. It will also make your system much simpler and easier to scale.
Forgot to mention, in 2004, Paypal had an outage for almost a week due to some smart guys introduced two-phase commit into their Production system. Great idea to guarantee ACID, also a text book example use case (banking transaction). Sadly, in reality, it doesn’t work.
Posted in Engineering, Operations, Software | 3 Comments »
Posted by EngineSmith on October 28, 2010
Several weeks ago I wrote Redis – Cache done right. Since then, we deployed a cluster of 6 Redis nodes in Production. The result: simply awesome!
It was a smooth ride, billions of cache hits every day, very low CPU load. The application logic is super simple, no fancy locking etc required since list/set are native data structure with atomic operations. We implemented our own distributed hashtable algorithm (consistent hashing) to distribute keys into the cluster (borrowed from memcached’s Java client). Our MySQL database load has been dropped a lot, as well as we now caches lots of Facebook call results (since their API sucks, high failure rate, and average 2-10 seconds response time).
By the way, we didn’t use snapshot, append-only log etc yet. For now it is just a read-only cache. Later on we will use those fancy stuff as well.
The only issue we ran into was: we forgot to set maxmemory, Redis happily hit the physical as well as swap limit, and crashed the whole server. 🙂 Since CPU load is low, we actually now runs 3+ instances on each physical machine to form 3 individual clusters.
To sum it up, love this guy’s twitter.
Posted in Engineering, Operations, Software | Tagged: redis cache scalability | 4 Comments »
Posted by EngineSmith on September 12, 2010
We implemented one of our product purely on top Cassandra back in March 2010 with version 0.5.1. It went live in May, and tanked in the first week due to crazy traffic. Three painful, sleepless weeks later, Cassandra finally dropped its pants and started corrupting our data. We shutdown the site, spent three full days, converted backend to MySQL clusters. Since then, our product has grown steadily until now (September 2010), with 8M+ users and 1M+ DAU (daily active user).
I don’t want to restate all the great benefits Cassandra provides, we were so excited by that. Unfortunately, if it is too good to be true, it usually is:
- We started with a 6-node Cassandra cluster, big mistake. Some people later suggested to start with 50+ nodes. 🙂 You will see the reason why below. Yes, it may be web-scale, but not for a small startup I guess.
- We set quorum=2, replication=3, which means 5 nodes are minimum. Unfortunately, during the initial weeks, we lost one out of six node, so our ability to stand for node failures are pretty low.
- Cassandra has a “binary-log” kind of mechanism, it appends to that file(s) until it reaches a limit. Then it needs to compress compact that big file, during which, this node is kind of “unresponsive”. All nodes uses a protocol “Gossip” to communicate, if one node is unresponsive, all its neighbours will think it is dead. (Even without node failure, very often we see discrepancy of the topology from each node’s view). When this compression compaction happens at high traffic, we were seeing cascading failures across the ring all the time.
- Later people suggested to use different disk partitions for “binary-log” and data. Well, that will require at least 4 disks on a server (mirrored 2 for each, and probably you will want another pair for OS, I don’t think we are crazy enough to run on single disks), and I wouldn’t call it a “commodity hardware” any more.
- At the time (May 2010 version 0.5.1) there is only ONE very few I/O parameter you can tune for Cassandra, it makes you doubt: hmm, I wonder why MySQL spent the last 10+ years have those many I/O tuning mechanisms?
I felt lucky we dumped it early on, at the moment people were asking us: why Digg/Facebook/Twitter all jumped on Cassandra and you guys ditched it? Maybe you are not competent enough? Turns out later the reality is:
Recently we also hear often: why don’t you use MongoDB, it does everything for you, and xxx is using it. If you pay just a little bit attention, MongoDB sacrifice your data consistency for performance, big time, almost like cheating:
Enjoy this one, “/dev/null is web scale“.
Posted in Engineering, Operations | Tagged: nosql cassandra mongodb | 20 Comments »
Posted by EngineSmith on September 11, 2010
One of my favorite interview question is to write the simplest function to sort the characters in a string, no gimmicks, no tricks, just to see if the code is clean, compilable and easy to follow. i.e. a simple bubble sort will do, simple? Don’t laugh at it yet.
Surprisingly, there is a 80%+ failure rate (wrong algorithm, major errors in the code etc.) in the last 5 years within 100+ candidates, especially some very senior, architect level guys. Within the ones who finished it correctly, I probably hired most of them (of course, they have to pass other stuff too).
Several interesting observations:
- Many senior guys got offended if they can’t get it right. The attitude shown is much more important than the question itself.
- Most US educated engineers will use “merge sort” first (to be honest, the concept is simple, but the code is long, and many failed at merging two arrays of characters or recursions). Foreign engineers usually pick bubble sort, super simple, you can finish it in less than 10 lines.
- So far only one guy wrote “quick sort” in front of me. He got it completely right, and can also briefly tell me why it is efficient. I think he is just well prepared, but still showed his efforts.
- For some, sorting is such an alien concept. I wonder if I pick a guy from the street without any computer science education, he at least may be able to describe how he can sort a bunch of stuff. Many engineers couldn’t even start the thought process.
Here is an interesting thing about sorting, each algorithm has its own unique sounds!
Posted in Engineering, Software | Tagged: sort interview | 1 Comment »
Posted by EngineSmith on September 11, 2010
We have been using MMM for over a year now in Production to manage master-master MySQL cluster. It worked out pretty well for a while, until recently we had couple outages related to it and realized its limitations.
MMM basically manages VIPs (virtual IP) and associates roles with MySQL nodes in the cluster. In case of two nodes, you configure two VIPs: one for reader, one for writer (if you do read-write split, which we did before). In case of failure, or replication delay is too long, MMM is going to switch the VIP to point the right host. So what’s the catch?
- First of all, the VIP is configured on each MySQL node itself. In order to make the switch, MMM needs to SSH to the node, and change its network config to release or take the VIP.
- In order for the switch to happen, MMM actually expects to issue some MySQL commands to the node (like clean up, or set read-only etc)
What is wrong with those? MySQL node can fail, when things fail, they can fail in ANY POSSIBLE WAYS. We had two outages recently, once due to a hardware failure, the MySQL node can’t accept any SSH connections; another time, the MySQL instance was hosed and it held the MMM request and didn’t respond at all (it would have worked if it actually rejected the request). In both scenarios, MMM basically got stuck (internal states were all messed up) and didn’t do its job. Of course outages happened, 3-5 AM in the morning. 😦
What is the lesson? The fail-over mechanism should NEVER rely on the operations on the fail-able nodes, and states should be kept outside as well. The simpler the fail-over mechanism, the better off you are (KISS rule). Here are some thoughts we are still experimenting:
- Don’t use VIP. Instead, let application side handle fail-over (i.e. specifying two physical IP addresses, if one fail, use the other). Monitor script activate/de-activate network switch port to implement fail-over (hardware network path is guaranteed to be a boolean). Most MySQL server node has two network ports, use one for application access, another one for replication/management.
- Using LVS to manage VIP. Seems still too complicated, and I am concerned about the split-brain scenario, and what kind of availability it provides. The worst case is both MySQL nodes are taking traffic and you don’t even know about it.
- Using hardware to manage VIP, the simplest, efficient and guaranteed way. We are considering using F5 BigIP. The issue is cost and bandwidth limit.
We are still investigating, will keep you posted. By the way, forgot to mention, we currently have 12 nodes in 6 clusters, all of them running on SSD drives, with peak 3-4 QPS per cluster.
Posted in Engineering, Operations | Tagged: mysql mmm failover scalability | 2 Comments »
Posted by EngineSmith on September 11, 2010
In MySQL scale-out practices, many people chose to split their read and write to different nodes. This may work well for a read-most environment, but actually significantly painful and wrong in write-most settings, like ours.
- Two MySQL nodes, master-master replication
- Only the active node is taking writes (writer)
- Two readers, round-robin between the two nodes
- MMM is used to monitor replication delay and handle role switching
This seems to be the common sense approach, make sure only one node is writer to guarantee consistency, while utilize both nodes as readers to share the load. We did that for the last 6 months, and found out in practice this is wrong:
- Plan your capacity for the worst case scenario. The point of using two nodes is to handle fail-over. When one node dies, all read and write traffic will go to the other node. You should plan to have one node handling all traffic, the read/write split gives you an illusion that you have the capacity until the failover, it is too dangerous.
- Splitting read/write actually made your application logic super complicated due to replication latency. Regardless of your tolerance level, at some critical point, you have to write some ugly code like this: a = reader.get(); if (a == null) a = writer.get();
- MMM is not designed very well to handle role switching during failure cases (I will write another post about it later), using two roles make the situation fairly messy.
So, here is what we did recently, in one cluster always have only one active node taking both reads and writes. MMM only handles fail-over (not replication delay). Then over-sharding on the cluster (prepare to split the shard in case it can’t handle the load). Life is much much simpler now.
Posted in Engineering, Operations | Tagged: failover, mmm, mysql | Leave a Comment »
Posted by EngineSmith on August 28, 2009
Let’s face it, developing web application in Java/JSP is really really painful, regardless which framework you work with. You got to restart your application server very often, for some simple Java classes or layout/configuration changes. But we are kind of stuck with it, since most of our domain logic are already written in Java, using PHP or Rails will require us to either rewrite domain logic, or expose them as HTTP services. Both sounds a bit crazy unless you got a web guru in house.
Finally Grails comes to the rescue, there are couple things which are just so amazing about it, you get the best from both world: the strictness of compiling language as Java for domain logic, the dynamics and convenience of scripting language Groovy. It is truly a god send:
- The only thing deployed to Production is a web application (WAR file) which contains Java classes only. Everything runs in compiled Java class form. You do NOT need Grails or Groovy environment in Production. You got all the benefits of mature JVM: multi-threading, heap management, garbage collection and great scalability
- Most things are dynamic loaded at Development time. So you don’t need to restart server all the time
- Seamless integration with existing Java domain logic, Spring is native. Intellij IDEA contributed the mighty compiler which compiles Groovy file and Java file at the same time, and, you can debug Groovy class inside IDEA
Furthermore, Grails has tons of convenience stuff to make developers life super easier (like Rails), for example, they have a wonderful example which takes 10 lines of code to add full text search ability for a database entity by using Lucene. Those don’t contribute to my decision, but icings on the cake are just so sweet!
Unfortunately, Grails has too many code-by-convention stuff which makes it hard to be wrapped into existing big Java projects. There are some tricks and hacks we did to make it work, I will cover them in future blogs.
Posted in Engineering, Software | Tagged: grails groovy spring web java | 2 Comments »