MongoDB + Hadoop = gg

I’m not sure if kids these days still do this, but when I was a gamer and I had just beat someone soundly, I simply typed “gg”. It’s short for “good game” but it really means “I just dominated you” or “game over, loser”. When you use Mongo with Hadoop, it’s kinda like that. With Mongo, you get a flexible, scalable database that excels at real-time processing. More and more startups are using it today and it’s our primary database (here’s why). With Hadoop, you get a distributed processing framework that handles everything you can’t or don’t want to process in real-time. It’s even easier to scale than Mongo, Amazon’s productized it (Elastic Map/Reduce), and it’s the swiss army knife of Big Data. Here are some reasons why they are a match made in data store heaven:

You Complement Me

Mongo is fast; it’s optimized for speed. Say goodbye to transactions and joins and other features you may not need. You can use it as your primary database to support your application in real-time. It’s not so good on large datasets or complex querying. You lost joins remember? And you want to shard my what? This is where Hadoop comes in. If you’re thinking about doing Big Data analytics, take those Nginx logs and crunch those numbers in your Hadoop Cluster. Or if you’re tinkering with the latest machine learning algorithms to predict your users’s preferences–Taste Graph anyone?–it comes in handy.

Two Words: Map and Reduce

Map/Reduce (M/R) is at the core of Hadoop. It allows you to break down complex tasks into manageable chunks of data and processing. Mongo took a page out of Hadoop’s book when it included an implementation of M/R. It makes it even easier, in my view, because writing the functions in Hadoop’s native Java is usually more confusing then writing Javascript for Mongo. Add some Ruby and you’ve got dynamic M/R in your Rails app! You can write mappers and reducers in Mongo to validate your Hadoop Java code on a smaller data set. And when you’re ready for the big show, you can fire up your 1000-server cluster to find the question to 42.

Have Your Cake and Eat It Too

If you don’t know what to choose for your task, you can always use both at the same time. With a plugin, you can use mongo as an input or output for Hadoop. It even has some optimizations for splitting the input on every chunk in a sharded environment. We’ve tried this for one of our features and it works very nicely. Eventually, if your data requirements may grow such that you’ll have to go fully into Hadoop, but you can get away with this hybrid approach for a long time. If you’re looking to speed up processing time, you could farm out some data to Hadoop, have your cluster crunch the data in bite-size chunks, and do some more processing in your application–all within a Resque job.

Art Credit

4 thoughts on “MongoDB + Hadoop = gg

    • Hi Antoine,

      We have used Mongo with our own cluster both in AWS as well as in our colo. At the time, EMR was just being introduced and we needed more control so we didn’t roll with that. EMR might be ready today, though.

      Jimmy

      • Ok thank you!

        Did you ever find a mismatch between the number of document you wanted to M/R and the number of document actually mapped?

        Thank you.
        Antoine

      • There may be different reasons you are seeing this. Mongo’s “count” feature was buggy in some versions. In fact, it was a huge performance and accuracy liability for us. So beware when you’re using that to compare. If you’re using a replica set, you could also have data inconsistencies that are affecting your total. This shouldn’t actually happen but I think in some versions you can do a M/R while data is being changed and see inconsistencies. Keep in mind that the latest versions should be much more stable than the ones we “grew up” with. If you’re seeing inconsistencies, I suggest trying M/R on a single mongod process with no writes happening. If you can’t fit it onto a single instance, either use a machine with more memory or graduate to Hadoop.

        Good luck!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s