EngineSmith's Blog

Engineering Craftsman

SSD Rules

Posted by EngineSmith on June 22, 2011

Deal with your own servers with freaking bad-ass disk I/O can save your infrastructure complexity and sometimes a life saver for a startup. We have been using Intel X-25E SSD drives as our only MySQL storage for a year now. And it is simply amazing and cost effective. If you want to make your life a living hell and deal with turtle speed I/O, tons of EC2 instances and constant fails, try EC2.

Here is a video talking about the big win of SSD. My favorite quote: “50TB SSD in one machine for $80K, and you can fucking skip sharding”.

Posted in Engineering, Hardware, Operations, Startup | Tagged: , , , | 2 Comments »

“Be sexy, go down”

Posted by EngineSmith on April 18, 2011

This video from MySQL 2011 conference is hilarious. It has been so true in this industry, sadly: if you don’t have down time, people think you are not sexy (at least not trying to solve hard problems). If you go down all the time, and publicly talk about the “hard” problems you are facing and all the sexy new products you are trying, no matter how ridiculous it is, you win. Scale fail is the new PR.

A successful operations team should be quiet, to an extent you don’t even realize they exist. Unfortunately, that’s when people start taking them for granted.

Posted in Operations | Tagged: | Leave a Comment »

Clipper Card: nightware product design

Posted by EngineSmith on April 4, 2011

Clipper Card is a new San Francisco bay area transportation payment system intended for all main public transportation systems, including Caltrain, BART (subway), VTA (buses) etc.  Well, it was a complete disaster so far and I believe it has marked the death of public transportation development in the heart of Silicon Valley (quite ironic, right?). The whole product is simply a joke, my experience today can give you something fun to enjoy.

I ride Caltrain daily and buy monthly pass on my Clipper Card. The rules are:

  • You have to tag on and off once ONLY on the first day of month when you travel in order to activate your monthly pass. No idea why this is necessary, maybe they just want to save one cron job (technical term for scheduled job). If you don’t do this, Caltrain conductor will read that you do NOT have a valid ticket and will either kick you off the train or give you a citation ($250 minimum), even though I think they can see clearly on their device that you have an “inactive” monthly pass on your card.
  • If you tag on in source station and forget to tag off in destination station, Clipper Card will charge you the maximum fare possible from your originating station. Basically they think you are most likely cheating in such cases, and should pay the penalty (“you are assumed guilty”).

Okay, if you are still reading, I take you are well educated enough to understand those rules, as I was until this morning.

  • Luckily without a reminder, I remembered to tag on at 10:30 AM on 4/4/2011 (since I didn’t go to work on 4/1/2011). By the way, the card reader message showed absolutely nothing about my monthly pass, it acted as if it just deducted the maximum fare of $8.50 from my card with a remaining balance displayed. The theory is that when I tag off, they will refund my $8.50, and show a very vague “Pass OKAY” message indicating the monthly pass has been activated.
  • Sadly, I forgot to tag 0ff (how can I remember to do it once in 30 days). By the time I remembered, it was already 3 PM. Thinking there might be a time window allowed to travel (some say it is 6 hours), I rushed back to the station and tagged.
  • The super intelligent machine showed something meaning “you just opened another trip, $9.00 has been deducted from your card”.
  • “WTF!” I was stunned there, looking around, nobody can help (and I dare not to tag the card again trying to cancel this trip like many of you might be thinking).🙂

Went back to my office, checked on the online access, right, they charged me $8.50 in the morning already, and now have a new trip of $9 charged. Now the only hope is to call their customer support line to talk to a human. Turns out the time window allowed for travel is only 4 hours, and that’s why they charged me (assuming I am cheating, even though I have a monthly pass on the card). “Congratulations though, your monthly pass has been activated.” – that is the exactly the words from the customer support guy.

I had to call them back tomorrow to get both charges refunded, as an one time courtesy, saying the support. I know, they think I am really stupid to “forget” to do such a simple thing once a month.

Frankly, this is the most retarded software system I have ever seen. With the help of Clipper Card, the bay area already terribly in deficit public transportation systems may die much quicker. As a consumer, I don’t really care what kind of complicated problems they are trying to solve, if it makes things harder and messier, it is a fail. From design perspective, what went wrong?

  • The purpose of the system is to simplify people’s life. You can’t push the burden to ordinary users to “remember” and “apply” your complicated business flow (if …. else … then ….) just because you are lazy to make it simpler. I have a monthly pass, thus making me remember making an exception once per month is totally un-acceptable.
  • You can’t assume everyone is cheating, they are your valued customers and they are human. Human forgets and makes mistakes all the time. If in your system you know I have a monthly pass, why you still charge me? Guess what, everyone will call your service department for a refund. Do you know how much it costs to take one call? My guess is around $50. Also, those calls won’t improve your service ratings, since it is merely remedying the stupid design flaws in your system in the first place.
  • Customer Support should NEVER the only way to solve a problem. Surprisingly, you can only charge your card in Walgreens store, or online, but you can’t find out what’s going on through them. There is not a single device out there can help you to manage your card (like ATM). If you make a mistake, in any form, you have to call, or risk a citation (which you may have to do, since next train maybe an hour away).

By the way, just remembered another funny experience in BART ticket system many years ago (not sure if it is still the case today): I inserted a $20 bill into the ticket machine, instead of directly asking me where I want to go, or how many tickets I want, it gave me a list of options to select: do you want to buy: 5 tickets of $4? or 4 tickets of $4.5? etc. Wow, very intelligent machine, I have to be good at math to understand what you meant. However, why don’t you listen to what I want instead?

Posted in Engineering, Software | Tagged: , , , , , | 2 Comments »

V is for Victory: Vertica and VoltDB

Posted by EngineSmith on March 25, 2011

Both are Michael Stonebraker’s startup, and we recently just adopted them both. The experience so far? Amazing.

We setup a 6-node Vertica cluster in one week with data loading process as well Tableau generating reports. The simplicity and efficiency is just mind-blowing, comparing of our previous failed Hadoop based analytics project, this one is just a breeze. Of course, it has a price tag, but frankly, TOTALLY worth it! Much better than spending several engineers on it for months and still get a half-baked, super complicated and almost violent Hadoop analytics platform (Hadoop is not for the weak minded, small budgeted and resource limited startup).

  • Last week, by mistake, we loaded 10 billion rows of duplicated data into our Vertica cluster. It was still running, though a bit slow.🙂
  • The rich analytic functions are super powerful. A path analysis (analyzing the page/click flow among all users) takes just 4 seconds over 60M rows.

If you are a startup seriously considering analytics, try Vertica before you waste all your money/resource on Hadoop/ETL/Data Warehouse solutions. They are not bad products, just too complicated. With Vertica’s powerful feature sets and linear scalability, you can simplify your data flow significantly. You will realize that start-schema is just over-rated, you can write super complicated but still blazing fast queries over de-normalized schema. SQL is just lovely (while map-reduce is just painful and awkward, one lesson we learned is: who can verify that the map-reduce code IS doing the right thing any way?).

By the way, if anybody tries to deal with big-data with some solution which can only run on one physical machine. Stop the joke.

VoltDB, it is a proper ACID SQL database on steroid. Who said you have to sacrifice consistency for scalability? We have had enough bad experiences with bad NoSQL products (see my previous posts). VoltDB is simply a god-send. Of course you have to loose something (like schema change needs restart the whole cluster – they are working to improve this, and you have write stored procedures instead of ad-hoc SQL), I think those are totally reasonable: honestly, who will die if you shutdown your site for 10 minutes a week for maintenance?

VoltDB is an in-memory only database (using k-safety and snapshot for redundancy) with linear scalability (proven). We were considering Redis for persistence for a bit (it also is in-memory with replication and snapshot), however,  Redis is swinging its directions between support clustering (transparent sharding) or disk persistence (to me a total disaster to go this direction). Since it is not settled, and manual sharding is just a big no-no, we settled on VoltDB.

We decided to pay for VoltDB support even though the community edition is perfectly enough for our purpose (we do operations through our own scripts instead of web GUI anyway). Also, really wanted to contribute to them to keep up the wonderful work.

Posted in Operations, Software | Tagged: | 2 Comments »

Who is designing your product?

Posted by EngineSmith on March 25, 2011

Ever read some document called PRD (Product Requirement Document)? In most companies (boring ones at least), it is Product Manager (PM)’s job to write up this document. Then a group of engineers rush ahead to get the product built according to the “spec”. What is the result? Most likely, the product will be late, ugly and buggy, way different than the PRD, and worst of all, nobody wants to use it.

In my personal experiences, most PMs are MBAs who have no experience in any form of product design, needless to say Web or Mobile GUI design. How do you know if they are any good, except which top business school they are from? Product is NOT business, it is art. My criteria on good product designer is: he/she has the guts to say NO. Too many time, the PRD didn’t thoroughly cover all the possible permutations of a scenario, and engineers always always like to “over-engineer” and ask the question, such as:

  • If user can delete one message, do you want to allow them to delete more than one message?
  • If user can create a new “something” here, can they delete/modify it later?

From technical perspective, they are perfectly great questions. But should a PRODUCT always do everything technically possible? Majority PM’s answer will be “good idea, let’s add it since it is useful to the user”. The best ones will answer “no, it is NOT an important feature for majority of the users, AND it made the product ugly/stupid/complicated”. These PMs are managing the soul of the product, not a SPEC. Example? iPhone original mail system, no multiple delete. Yes, there are tons of complaints, guess what, it doesn’t matter.

PM should be the soul of your product, not a executioner. It is a great sign that the founders of a startup are doing the design themselves. The moment they hire a business guy to manage a product, you know they handed their baby to a complete stranger (who might pretend he/she loves the baby, though in realty it is just a job).

Posted in Design, Startup | Leave a Comment »

Your car has more than 10M lines of code

Posted by EngineSmith on February 14, 2011

Luxury car has more than 100M lines of code, while F-22 Raptor has only 1.7M lines of code. Mercedes-Benz S-class only the navigation system contains more than 20M lines of code.

From somewhere else I found that average car from Ford contains more than 10M lines of code. I think we are doomed with this trend. Everybody seems like to reinvent the wheel again and again, no matter how lame they are with it.

Same thing happening in the web world, every several months, there are some new kids on the block trying to solve an old problem with a completely new approach. “You don’t throw away the baby with the bath water”, there is some quote like from from Drizzle project (the re-vamp of MySQL project), if I remember correctly. So many NoSQL products are trying to replace the existing rock solid MySQL, backed by remarkable marketing hype (and propaganda campaign machines). Only time will tell, just like how MySQL survived so many years.

[update] A colleague found this link, amazing: F22 got zapped by International Date Line.

Posted in Engineering, Software | Tagged: , , , , | Leave a Comment »

MySQL JDBC Multiple Query Trap

Posted by EngineSmith on January 20, 2011

This is a pretty big lesson recently. We are using MySQL JDBC driver against 5.1 Percona build. For a simple use case, we decided to use transaction (though we know it doesn’t scale well). Though stored procedure can be used, we chose to use a trick in JDBC driver: allowMultipleQueries. http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html

Basically, the following statements are put together as one iBatis query: begin; insert into A …; update B where ….; end;

This thing worked perfectly fine for several months, until one day, after new release, suddenly our MySQL started having trouble, it began to show errors of deadlocks. Tons of random queries are grouped together into big transactions and causing deadlocks. We scratched our head for a long time since absolutely nothing has changed in that release related to this transaction logic, neither did we touch MySQL or change JDBC driver. Seems the transaction boundary was extended randomly outside of the those two above queries, and since we use connection pooling (BoneCP), consequent queries on the same connection were combined into big transactions. This is really terrible!

Eventually we took out this allowMultipleQueries trick and did everything in plain two step SQLs (yeah, no transactions). Until today, we still don’t know exactly what triggered the problem since it was working fine for several months.

By the way, a side note about transactions. It sounded like a perfect solution on paper, in reality, especially web world (where you have sharding, many database nodes), it doesn’t work. One perspective to look at is: modern hardware/network is quite reliable nowadays, comparing to the cost to ensure transaction, probably it is better to spend the time/money on ways to fix things (tool, customer support, reconciliation) if a transaction was interrupted. It will also make your system much simpler and easier to scale.

Forgot to mention, in 2004, Paypal had an outage for almost a week due to some smart guys introduced two-phase commit into their Production system. Great idea to guarantee ACID, also a text book example use case (banking transaction). Sadly, in reality, it doesn’t work.

Posted in Engineering, Operations, Software | 3 Comments »

Spectacular Black Friday

Posted by EngineSmith on November 29, 2010

Haha, the invisible war happened over Internet was nothing less spectacular either. I had first hand experience on this deal that night: Panasonic 42″ 1080p Plasma TV $298 @ buy.com w/free shipping. Lots of lessons.🙂

http://slickdeals.net/forums/showthread.php?sduid=0&t=2416009
http://slickdeals.net/forums/showthread.php?t=2421355
http://slickdeals.net/forums/showthread.php?t=2427091

For those who don’t want to read through the 300+ pages thread, here are some highlights:

– The deal lasted only 7 minutes on buy.com, before that, there are thousands of people clicking Refresh in the browser as well as posting to the forum. Most of them never saw the $298 price, why? Because the default seller is not buy.com on that page, you have to select buy.com on the side, and must add the item to the cart, regardless the price (took me 4 minutes to figure it out, but already too late).

– Several hundred lucky hard-core deal hunters completed the checkout on time. Lots of losers, reasons including:
– mis-typed credit card info (hands are shaking)
– changed mind at last step, decided to go back to get one more TV (greedy)
– didn’t have buy.com account set up (newbie)
– or stupidly chatting with customer support asking “where is the TV?” (dumb)

Of course, after 10 seconds, the TV is gone from their shopping cart. Several lucky guys completed the checkout too fast, forgot to review the shipping charge since the default shipping costs $150 (instead of the free shipping which is a selectable choice).

– 8 hours later, majority of the people who ordered the TV got an email stating that buy.com over-sold their inventory, and all their orders are canceled. Ends up only about 71 TV were sold.
– People gets angry, they dug out all buy.com executives’ personal info, facebook page, and started a war

– 11/26, buy.com started offering $50 gift certificate to those angry shoppers

– 11/27, buy.com backed down, decided to ship all the orders

Posted in Life, Misc | Leave a Comment »

Redis rocks!

Posted by EngineSmith on October 28, 2010

Several weeks ago I wrote Redis – Cache done right. Since then, we deployed a cluster of 6 Redis nodes in Production. The result: simply awesome!

It was a smooth ride, billions of cache hits every day, very low CPU load. The application logic is super simple, no fancy locking etc required since list/set are native data structure with atomic operations. We implemented our own distributed hashtable algorithm (consistent hashing) to distribute keys into the cluster (borrowed from memcached’s Java client). Our MySQL database load has been dropped a lot, as well as we now caches lots of Facebook call results (since their API sucks, high failure rate, and average 2-10 seconds response time).

By the way, we didn’t use snapshot, append-only log etc yet. For now it is just a read-only cache. Later on we will use those fancy stuff as well.

The only issue we ran into was: we forgot to set maxmemory, Redis happily hit the physical as well as swap limit, and crashed the whole server.🙂 Since CPU load is low, we actually now runs 3+ instances on each physical machine to form 3 individual clusters.

To sum it up, love this guy’s twitter.

Posted in Engineering, Operations, Software | Tagged: | 4 Comments »

NightclubCity: party game on Facebook

Posted by EngineSmith on October 28, 2010

Awesome promotion video, party on!

Posted in Misc | Tagged: | Leave a Comment »