Monday, October 03, 2011

Scalability

I've been helping various startups, and one of the things that's impressing me is how frequently these startups are launched on new infrastructure pieces like Ruby on Rails, written mostly by people who're not really trained formally in computer science or software engineering. This is a triumph of modern programming tools: I certainly didn't think that we'd get to a point where essentially product managers can actually write code, and asking these teams to execute in C++ or even Java would have set them back months. Even better, these startups don't even run or own their machines, choosing instead to use Amazon Web Services to launch and scale.

That is, until they suddenly hit a scaling inflection point and then someone like me gets called in to help out with the architecture and scalability problem. This is a good thing, by the way. Too many software engineers fresh out of school get hung up on the latest performance or scalability techniques and use them too aggressively when there's no need. It's far more important to launch a product quickly and get it to the point where you have product acceptance before you worry about performance. Even then, some of the latest techniques get you nowhere. One of my favorite examples came when I was at Mirapoint. A team of folks were working on the mail transfer agent. This is a relatively straight forward piece of code but when designing for high performance, they had decided on an architecture that made use of multiple threads and that made debugging hell. Brad Taylor came in, took a look at it, and rewrote the entire thing using a single-threaded select loop. Not only was that single-threaded select loop easier to debug, it actually ran faster, because the processor wasn't doing all those crazy context-switches just to get simple things done. In general, when scaling a product, your best bet is to first design for easy debugging and easy replication (so you can get horizontal scaling) by spinning up new processes rather than throwing the universe into an address space and launching multiple threads. The first person to articulate that philosophy to me was Michael Wolf at a dinner conversation with me and Steve Grimm.

Back to the startup. The problem with launching a startup on AWS and on Ruby on Rails is that when you hit the scalability inflection point, the easy solutions are not available. For instance, one company I helped were hammering their MySQL database with too many requests. While the number of queries were potentially large (and there was a fair bit of writing, so read-caching wasn't helping), their actual database was small. If they were hosting their own infrastructure instead of running on AWS, a simple straightforward solution that would have required no programming, would have been to just install a SSD (or if you have money burning a hole in your pocket, one of those crazy Fusion IO PIC boards, which bypass the SATA limitations and use the full bandwidth of the PCI bus for IO). SSDs are expensive, but even $400 or $8000 a pop is cheaper than the time it would take to rework the database to a NoSQL solution. Unfortunately, none of the existing cloud solutions will let you specify SSD-type performance for the machines you request.

Since they were essentially using MySQL as a blob store, they thought about exploring one of the NoSQL solutions. But the amount of data they had was so small it would have fit into main memory (of a relatively large server), so they could have potentially could use a simple shadow hashtable approach with a background thread to write the shadowed hashtable to disk for persistence. Unfortunately, that requires real threads and some simple locking, and the default implementation of Ruby doesn't support kernel threads (though JRuby does). Furthermore, by using Ruby On Rails, they'd written themselves into a corner where it would be difficult to extract the data layer out of their code so they could write to a NoSQL database anyway. What's interesting to me is that these were the people smart enough to know that they'd run into scalability limits with their software and infrastructure. There were probably many others who did not and chose to muddle along.

When I finally read about how Twitter was a big user of Ruby, all their scalability problems finally made sense. At some point, you do have to throw away your prototype and rewrite everything if you want to scale.
Post a Comment