MongoDB + Hadoop = gg

I’m not sure if kids these days still do this, but when I was a gamer and I had just beat someone soundly, I simply typed “gg”. It’s short for “good game” but it really means “I just dominated you” or “game over, loser”. When you use Mongo with Hadoop, it’s kinda like that. With Mongo, you get a flexible, scalable database that excels at real-time processing. More and more startups are using it today and it’s our primary database (here’s why). With Hadoop, you get a distributed processing framework that handles everything you can’t or don’t want to process in real-time. It’s even easier to scale than Mongo, Amazon’s productized it (Elastic Map/Reduce), and it’s the swiss army knife of Big Data. Here are some reasons why they are a match made in data store heaven:

You Complement Me

Mongo is fast; it’s optimized for speed. Say goodbye to transactions and joins and other features you may not need. You can use it as your primary database to support your application in real-time. It’s not so good on large datasets or complex querying. You lost joins remember? And you want to shard my what? This is where Hadoop comes in. If you’re thinking about doing Big Data analytics, take those Nginx logs and crunch those numbers in your Hadoop Cluster. Or if you’re tinkering with the latest machine learning algorithms to predict your users’s preferences–Taste Graph anyone?–it comes in handy.

Two Words: Map and Reduce

Map/Reduce (M/R) is at the core of Hadoop. It allows you to break down complex tasks into manageable chunks of data and processing. Mongo took a page out of Hadoop’s book when it included an implementation of M/R. It makes it even easier, in my view, because writing the functions in Hadoop’s native Java is usually more confusing then writing Javascript for Mongo. Add some Ruby and you’ve got dynamic M/R in your Rails app! You can write mappers and reducers in Mongo to validate your Hadoop Java code on a smaller data set. And when you’re ready for the big show, you can fire up your 1000-server cluster to find the question to 42.

Have Your Cake and Eat It Too

If you don’t know what to choose for your task, you can always use both at the same time. With a plugin, you can use mongo as an input or output for Hadoop. It even has some optimizations for splitting the input on every chunk in a sharded environment. We’ve tried this for one of our features and it works very nicely. Eventually, if your data requirements may grow such that you’ll have to go fully into Hadoop, but you can get away with this hybrid approach for a long time. If you’re looking to speed up processing time, you could farm out some data to Hadoop, have your cluster crunch the data in bite-size chunks, and do some more processing in your application–all within a Resque job.

Art Credit

When you should use MongoDB


Having built an enterprise SaaS gamification platform with hundreds of millions of documents and soon-to-be billions as we grow from hundreds of clients to thousands in the next few years, I’ve thought a lot about MongoDB as a primary database. We are pushing it to the limits and are living pretty close to the bleeding edge. Thus far we’ve been pretty happy with the choice but Mongo isn’t for everybody. I get the sense that some people are trying to use it for the wrong reasons and then complain when things don’t work out. Here are some of the reasons why we decided to use MongoDB:

We wanted a dynamic database schema. We are a Behavior Platform. We record arbitrary behaviors for our clients and do interesting things with them. For example, “Joe commented on a article” could easily be “Joe checked-in at a bar in San Francisco called 21st amendment on 3rd Street in the SOMA district”. By offering almost infinite flexibility, we can support a variety of use cases from e-commerce to education. This was our most important requirement. As a bonus, Mongo offers asynchronous indexing and thus we don’t need to do database migrations and all deploys now require zero down-time.

We wanted something that scaled easily. Given that we’re a platform and our data grows with the number of clients we have and over time, we aren’t your ordinary build-it-and-hope-they-will-come website. Our configuration started with Master-Slave, then Replica Sets, and now Sharding. In some of our applications and in certain environments, we still use non-sharded setups. Mongo makes it easier to setup these configurations but there’s still significant time involved to develop and harden your infrastructure. Once it’s setup though, it’s really nice to watch as data gets sharded automatically and rebalanced in realtime. It’s also nice to know that you have multiple redundant servers with automatic failover.

We wanted Map/Reduce. On top of flexibility in storing the data, we wanted flexibility in processing it. Mongo gave us the ability to develop certain features very quickly because our primary database supported this rich framework. We even wrote some early analytics implementations using the native Map/Reduce in Mongo. At a certain scale, doing Map/Reduce on your primary database will dramatically hinder normal performance, but Mongo gave us plenty of time to port our mappers and reducers easily to Hadoop.

There are many reasons to stick with a relational database. If you need transactions or you prefer the comfort of many more years of “bake-in” time, NoSQL will not sit well with you. With Mongo, there will be a smaller community of developers and less tools definitely fewer stack-overflow questions. However, if you’re trying to build an exceptional application or platform with ambitious requirements, Mongo might be the one for you.