Notes from the event: “Scaling Big Data Mining Infrastructure: The Twitter Experience” by Jimmy Lin

16 Jul

This evening I had the chance of attending an interesting talk at Ghent: “Scaling Big Data Mining Infrastructure: The Twitter Experience“. The presentation given by Jimmy Lin (and kindly hosted at the Massive Media office) was full of energy and insight, and I think more than 60 people who came from different parts of the Belgium would agree with me. Having a very strong academic background, as well as a great deal of experience in a very fast-paced environment such as Twitter, Lin gave a superb presentation that is difficult to forget. I have especially appreciated his ability to explain complex topics with very concrete real world examples and focusing on the core issues that matter most.

I took a few notes and I want to save them here as my memory is still fresh. I’ll simply summarize them in the form a few bullet points, do not expect full details, great technical accuracy and a nicely flowing narrative (and feel free to correct any errors or misunderstandings):

  • Twitter grew very quickly as a company in just a few years: from 140 people to 1400 people, from tens of servers to tens of thousands of servers, from a few TBs of data to a daily volume of 100 TB and more.
  • Data science and research has its glamorous part: New algorithms, cutting edge ideas and experiments, exciting results.
  • Data science in the “real world” is not so glamorous: Most of the time a data scientist will spend valuable time and resources to actually find the relevant data, understand it, clean it and massage it so that it can finally be fed into some classification, clustering, recommendation, etc. algorithms.
  • When Jimmy Lin joined, the analytics stack was used daily by 5-6 analysts, now, in its current version, it is used by dozens of teams of analysts daily.
  • Storing logs in a traditional RDBMS such as MySQL simply does not work and this became common knowledge by now.
  • On the other extreme you might prefer the flexibility of JSON but then you will run into different troubles.
  • Alternatives such as Thrift provides a very good balance between flexibility and structure. You can use Protocol Buffers as an alternative, but any alternative is better than JSON that can bite you in many ways: is the User ID a string or should be cast to integer? How do you represent NULL? Or the actual string “null”? How to represent lists? How to debug your program when you get bizarre type errors, or half of your data is in some JSON format with some delimited values, already forgotten by the programmer, forcing you to do source code and data archeology? Having a DDL (Data Definition Language) with added benefits such as efficient serializers goes a long way.
  • Use Scribe for logging before aggregating data and then writing it to HDFS and to Hadoop.
  • They have built an abstraction as a Data Access Layer: the motivation was being able to encode the data locations logically, separating it from physical, hardwired, difficult to remember details. With that help of layer, it became much easier to describe which data to load for Pig scripts.
  • When Twitter started applying machine learning techniques to its data sets, back in 2007-2008, it was like the Stone Age of machine learning: massive amounts of data was retrieved, it was downsampled so that it could fit a single server or a laptop, copied to that computer, processed there leading to a model that was trained and tested, then predictions were run and the results were updated back to the production servers. The downsampling approach by itself defeated the whole purpose of having big data to begin with. Moreover, this approach did not yield itself to a flexible and scalable analytics workflow at all.
  • One of the ways to create a supervised classification system is to minimize a loss function and one well-established method for that is gradient descent. Unfortunately gradient descent is not very performant when using Hadoop clusters.
  • Hadoop infrastructure is bad at iterative algorithms such as gradient descent and it is also highly sensitive to skew, among other disadvantages.
  • A reasonable alternative is to use stochastic gradient descent and have some sort of online learning, instead of the traditional approach that is batch learning.
  • The advantage of the approach described above is that it constantly updates the model, compatible with the real-time streaming approach, and in a sense doing machine learning on big data becomes simply writing another user aggregate function in Pig.
  • Using ensemble learning also can help a lot and with the current infrastructure and plumbing, it becomes much easier to use the parallel processing capabilities of Pig to run an ensemble of classifiers on the incoming data. (Ensemble learning helps minimize variance component of the error but not the bias.)
  • Another advantage of the current scheme is that online learning does not need to see all of the data to train the supervised classifier, it simply processes the data as it comes and then discards it, simply serializing the learned model as the output, which can then be very practically used for making predictions.
  • Stochastic gradient descent, similar to its traditional version has the risk of getting stuck at local minima and maxima, but having big data helps and online learning, combined with ensemble of classifiers mitigates the risk: a wrong classifier will exist for a very limited amount of time.
  • There are many fancy and state-of-the art machine learning algorithms out there, but without having invested in building all the infrastructure and plumbing to bring massive amounts of data easily and flexibly to those algorithms is not very useful at the end of the day: you need the whole system in order to create actionable insights from the analytics.
  • Once a heavy investment has been made for creating sophisticated big data infrastructure and plumbing, you have to use what you have instead of wishing for different system you had in your ideals.
  • The world of HPC (High Performance Computing) is also observing the open source big data ecosystem: People at Lawrence Livermore National Laboratory who run one of the fastest computers in the world, Sequoia, a 16 petaflops system with about 1.5 million cores think about what would happen if they ran Hadoop on their system.
  • In the world of HPC, some think that 1 PB of data is big whereas in the world of analytics people started to frequently see clusters processing 1 PB of data.
  • There are strong companies building attractive and proprietary big data infrastructure solutions but the whole field is a very fast moving target, in the long run open source based solutions will probably be in a much better position. Take that into account when you invest your time and resources for designing and building a solution.
1 Comment

Posted by on July 16, 2013 in Programlama


Tags: , , , , , , , , , , ,

One response to “Notes from the event: “Scaling Big Data Mining Infrastructure: The Twitter Experience” by Jimmy Lin

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: