Pachyderm and TubeMogul Share Their Big Data Application Platforms and Experience

Russ Savage

Russ Savage is an application engineer at Cask, focusing on building end to end big data applications using the CDAP platform. He previously worked at Elastic as a solutions architect where he built tools to combine internal data sources to discover new insights.

Russ Savage

It was so great to see everyone at the Big Data Applications Meetup last week! The meetup was sponsored by Cask, the company making big data applications easy, and by Ampool, and we would like to thank Milind Bhandarkar, the Founder and CEO of Ampool, for supporting this event. For those that couldn’t join us, we hope to see you next time for some great speakers and free food and beer.

We had two great talks last night beginning with an introduction to Pachyderm by Joe Doliner who is the Co-founder and CEO of the company. Pachyderm is a fresh look at a big data analytics platform deployed through Kubernetes and Docker. Using a container architecture, Pachyderm is able to provide the broad functionality of Hadoop while maintaining the ease of Docker.

In addition to leveraging containers as a core concept of the platform, Pachyderm also introduces version control for your data based on the semantics of Git. This means you can easily diff multiple versions of data throughout your pipeline which can save computation time and cost.

This idea of “Git for your Data” is also useful for teams collaborating. Team members can fork from the same “data repo” so that everyone is using the same source of truth when running computations.

The entire platform is open source so be sure to check out the Pachyderm repo on GitHub and their website,, to get started.


Our second and final talk was given by Murtaza Doctor, Director of Engineering at TubeMogul and John Trenkle, Chief Scientist at TubeMogul. They provided a brief but incredibly detailed overview of their big data platform and how online ad serving works. If you’ve ever wondered what happens behind the scenes when an ad shows up in your browser, you should definitely check out the video below.

TubeMogul has been serving online advertising for 10 years and has seen their data grow exponentially. Last year in 2015 they handled more than 12.6 trillion ad auctions through their system with up to 55 billion auctions per day in some cases. That massive amount of data means their ecosystem has evolved to handle their ever increasing needs encompassing data pipelines, machine learning and of course, a ton of analytics.

As they continue to grow, TubeMogul relies on big data technologies such as Hadoop, Hive, Spark, and Presto and is learning a lot along the way. Check out the full talk below for all the insights and details from their experience.


We hope to see you at the next Big Data Applications Meetup on Wednesday June 15th. Finally, we are always looking for great speakers so please reach out to one of the Meetup organizers if you would like to give a talk.

<< Return to Cask Blog