NoSQL Zone is brought to you in partnership with:

Mitch Pronschinske is a Senior Content Analyst at DZone. That means he writes and searches for the finest developer content in the land so that you don't have to. He often eats peanut butter and bananas, likes to make his own ringtones, enjoys card and board games, and is married to an underwear model. Mitch is a DZone Zone Leader and has posted 2573 posts at DZone. You can read more from them at their website. View Full User Profile

How DataSift is Datamining 120K Tweets Per Second

11.30.2011
| 9332 views |
  • submit to reddit
Attention architectural gurus!  Get ready to learn about how one company puts together its amazing datamining architecture, and hopefully you'll also walk away with some ideas of your own after reading Todd Hoff's new post on High Scalability. 

His post reviews the architecture of DataSift, a realtime Twitter datelining platform.

For the TL;DR crowd, I'll try to further summarize a simple look at the stack with some brief commentary:

  • C++ for high performance components; makes sense
  • PHP for the site, API server, internal web services, and a speedy, custom job queue manager; nice traditional choice
  • Java/Scala for communication with HBase and Map/Reduce jobs; nice to see some Scala, I was waiting for the Java
  • Ruby for Chef; a great DevOps tool that's written in Ruby, so it makes sense here

Data Stores

  • MySQL Percona server on SSD; Peter Zaitsev would be proud of the Percona choice
  • HBase cluster with 400TB of storage on...
  • about 30 Hadoop nodes; HBase and Hadoop seems to be a common combo for major data operations
  • Memcached; great caching of course
  • Redis is still used for some internal queues but it's on its way out

Message Queues

  • ZeroMQ (or ØMQ), a fast and lightweight broker-free messaging using Pub-Sub, Push-Pull, and Req-Rep while at DataSift; this piece is really interesting and gaining lots of traction at real-time operations like Loggly
  • Kafka - the persistent and distributed message queue used by LinkedIn. DataSift uses it for high-performance persistent queues. The Java and Scala are used to process its data.

CI and Deployment

  • Any code is pulled from Jenkins every 5 minutes when there's a change.  It's then automatically tested by a variety QA tools; Jenkins is the real deal it seems
  • All projects are built and packaged as RPMs and sent to a dev package repo; good old RPM
  • Chef, mentioned before, automates deployments and infrastructure configuration.

Monitoring

  • Lots of tools you'll hear about in the DevOps community here, first thier services emit StatsD events
  • Those events are combined with more system-level checks and added to Zenoss
  • The data is then visualized in Graphite

That was nice and quick, but it'd be even nicer to have an illustration of interactions between these technologies, which is just as important as know the contents of the stack.  Well DataSift has provided that too:

The full sized version is here.

From Todd Hoff, here is what he says is the point of it all:


  • Democratization of data access. Consumers can do their own data processing and analytics. An individual should, if they wished, be able to determine which breakfast cereal gets the most tweets, process the data, make the charts, and sell it to the brands. With a platform in place that's possible whereas if you had to set up your own tweet processing infrastructure it would be nearly impossible.
  • Software as a service to data. Put in a credit card, by an hour or months worth of data. Amazon EC2 model. Charge for what you use. No huge sign up period or fee. Can be a small or large player.
  • You don’t really need big data, you need insight. Now what we do with the data? Tools to create insights yourself. Data is worthless. Data must be used in context. 
  • The idea is to create a whole new slew of applications that you couldn't with Twitter’s API.
--Todd Hoff


And this only scratches the surface of this excellent post.  I encourage you to flag this post for more in-depth reading later.  There's tons of good info that I didn't even touch on.

Source: http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html


Comments

Amara Amjad replied on Sun, 2012/03/25 - 12:52am

Thanks it is an interesting article. I have a couple of queries about numbers though:

- You said that they process 250+ million tweets per day, which works out to be just under 3k per second. Twitter had a peak of just under 9k tweets per second when the news about Beyonce's baby broke. So where does the peak of 120k tweets per second come from?

- 10c per 1000 tweets seems like an awful lot, does this fee include result rights as well though? If so do you have any ideas about the fees for full access without resale rights.

Thanks!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.