I’ve been encouraged to look at MongoDB (and other NoSQL implementations) because someone seems to have the idea this is some new panacea. I first smelled a rat when I read of a seminar promoting MongoDB by someone who has created a social web site and studied poetry. Now that’s no to detract from the study of poetry. However if I’m going to be persuaded of NoSQL it’s going to be through concrete examples with performance stats and ideally by someone with a PhD set algebra.
NoSQL does have a place. The birthplace of most of the ideas in current NoSQL implementations is the ‘Amazon Dynamo’ paper released by Amazon defining their requirements for data storage.
Their argument is that for some tasks the relational database concepts of normalization and especially transaction consistency are overkill and too expensive. Amazon does not need consistent writes if they are being written simultaneously several times and mirrored on lots of cheap kit. The data for a shopping cart does not have to be normalized beyond the level of session and/or user. For Amazon it is acceptable that it will be accessed as a blob and updated as a blob by some application which knows the internal structure of the blob.
Of course just because you use a database engine which supports relations, you don’t have to use them. They can also store de-normalized data so this was not the reason for looking for an alternative to an open source relational engine. Amazon also wanted their data replicated across multiple nodes. At the time the cheap databases (MySQL) were not up to the task so Amazon created their own storage engine (Dynamo). Facebook did likewise (Cassandra).
The expensive database engines like Oracle have long supported distributed and grid processing but at a price Amazon and Facebook clearly did not want to confront but now MySQL has caught up. So the case for using a special NoSQL engine is not so clear. But that doesn’t stop the NoSQL application vendors raising the hype bar. Here’s a quote from the MongoDB 2.0 press release:
For example, with 1.5 million new classified ads posted every day, Craigslist must archive billions of records in many different formats, and must be able to query and report on these archives at runtime. MySQL’s lack of flexibility became a barrier for continued usage: A simple schema change on their vast archive took months to complete, preventing them from pushing new features.
Now I’ve watched both of the presentations by Jeremy Zawodny who decided to use MongoDB at Craigslist and I’d say this is a bit of an over-statement. I think someone reading this statement is likely to believe Craigslist have ditched MySQL in favor of MongoDB – which just isn’t true. According to my understanding of Jeremy’s talks, CL use MySQL to provide the very fast data retrieval they need to support the list. And MongoDB is only used as glorified backup manager. Their MySQL database, which is still their primary data engine containing millions of active ads, does take time to update if schema changes are needed (which is infrequently) and even longer to update the same schema when used to record archive data. The archive is important because using it a user is able to review past ads and re-instate one if needed (that is, create a record in the primary database) and using Mongo allows them to store in archive whatever the structure was at the time so there’s never a need to update the schema.
Using Mongo DB as a backup store seems like a reasonable use-case but what I didn’t hear was why they didn’t use MySQL in a denormalized form. Why didn’t they create a simpler, denormalized schema in MySQL? May be a case of boys wanting to play with new toys? And maybe not a little concern at the future of MySQL under the stewardship of Oracle?