Tackling Big Data Problems at Scale

Posted by | January 19, 2011 | Best Practices | No Comments

Servicing rich media ads and analytics for mobile devices brings about a number of big data challenges. At Medialets, we spend a great deal of time focusing on scaling our ability to process data, aggregate metrics and service the tens of millions of mobile devices reached by our premium publisher apps, ad network and mediator partners, and analytics users.

Though some of these challenges are unique to how we deliver ads, consume data and generate analytics, many are shared big data and distributed systems challenges seen by the likes of Google, eBay, Facebook and many others. We often build our own software to solve these challenges at Medialets, but sometimes, the challenges are so great we turn to systems like Hadoop to solve them.

A core objective of our infrastructure team is to keep our data continuously automated and available as quickly as possible to our users through Medialytics, our reporting and ad management interface.

The challenge lies in processing millions of analytic event elements per second with dozens of permutations and aggregations for each element and getting this data out expeditiously for reporting. We process 2-3 terabytes of new data that comes into Medialets every day! Updated reports are posted on a continuous basis multiples times per hour.

The first step to us taking data in is essentially managed by our custom Java based data tracking servers. We do exactly what we need to capture and consume the data that mobile devices send us with response times under 2ms with a 3% variance. This scale is achieved by not just having developed to a specific architecture but by also utilizing concurrency patterns such as CAS (compare-and-swap) and other methodologies to better utilize the hardware we run on to maximize its resources. We are always responsive and highly available to our mobile devices, as evidenced by our Ad Serving tier that takes connections and responds to them within 200 microseconds with a 2% variance.

From there we start to organize the billions of objects that come in each day and turn them into the data sets we need to “crunch” with Hadoop. We sprinkle in lots of additional meta data from our data warehouse reporting system with each object processed. We have a nice setup that’s able to handle 10x the load we currently have, but continue to look for innovative approaches to push some of this functionality up stream.

Before data makes it into Hadoop, we have an event processing system developed using Ruby & Python. We built our system before Azkaban was released so we had to do this from the ground up. We began with Ruby but, as we started to do more Python streaming jobs, the next iterations of this tier are being built out in Python. A lot of Ruby is still actively coded.

Once the jobs go into our modestly sized and growing Hadoop cluster, we have both Java & Python Map/Reduce jobs churning away 24×7. We utilize some Pig scripts with custom loaders for our sequence files. All of the results from this aggregated data is automated and available to different reporting systems with standard reports available inMedialytics.

This is just a taste of the sorts of things we do on the backend. If you’re a systems engineer or a developer with experience and/or excitement about working at scale with big data and distributed systems, we’re hiring! Ping us with your resume athttps://medialets.com/who-we-are/careers/open-positions/ so we can follow-up.

Joe Stein
Manager, Server Platforms

About Bob Ross

Bob is an Interactive Designer at Medialets in New York City.