Nerdfest: BigTable / HBase is cool

BigTable is insanely simple and powerful, and came out of some pioneering work at Google. The fundamental idea is that you can store data in terms of relationships between a and b, aka key-value pairs in a hashtable. Any nerd will tell you this is a pretty important construct.

BigTable, or it's open source equivalent HBase, gives you a way of using this simple hashtable concept to replace relational databases. Instead of relying on SQL and an engine like MySQL or Oracle, you just store and retrieve data in big hashtables of keyvalue pairs. Since you're not using a database, you simply aren't in a mindset where you want to normalize everything into separate tables of information. Instead, you put all your data into hashes that contain everything you would need for a particular request, which means a bit of duplication of information. The advantage is reading data becomes brain-dead easy, which is important because for most popular public websites you're 99.99% of the time reading data, and 0.01% of the time writing it.

Databases, on the other hand, make writes relatively easy (since no data is duplicated) but then reads are harder because you have to re-join into something useful. These joins also make it really hard when you have more data than can fit on one computer. In the past, you just had to buy a larger, more expensive, more hardcore server. These days, everyone has settled on the power of having many cheap servers, instead. Facebook, for instance, has 10,000 of them! Databases can be 'sharded' to fit across multiple machines, but these techniques can be tricky and are usually custom / homegrown for now. One technique of sharding involves duplicating data so that you don't have to join across multiple machines -- which is the same concept that BigTable tries to codify into the way it works. No joins = sharding across as many machines as you want.

Google has built every service they provide using BigTable because it ensures whatever they build can scale to millions of users at any time. Their new platform App Engine actually ONLY supports BigTable, which is one reason why many conventional developers are staying away. But I remain curious about the whole thing, and would love to investigate using HBase with Rails. (Just noticed there's an open source project called Ruby Rhino that kicked off just a few months back that is trying to make that a reality. It's bleeding edge, folks!)

A few related links:
Understanding HBase and BigTable
Matching Impedance: When to use HBase