Cask Anticipates Google’s DataFlow to Flourish in Apache

Andreas Neumann, Chief Architect, develops big data software at Cask and has previously done so at places that are known for massive scale. Prior to Cask, he was Chief Architect for Hadoop at Yahoo!, and he was previously a research engineer at Yahoo! and a search architect at IBM.

The Big Data community has long been searching for a good abstraction to express data processing pipelines. And now one possible answer to that quest may have emerged.

For ad-hoc querying of data that standard is clearly SQL and, not surprisingly, SQL has found its way back into the “noSQL” world in various incarnations of SQL-on-Hadoop engines. As a result of this resurgence of SQL, people have attempted to use SQL to describe data processing pipelines on Hadoop.

Such pipelines often consist of a series of processing steps, including but not limited to transformation (projection, conversion, canonicalization, normalization), annotation (tagging, classification, categorization, lookups), grouping and shuffling, aggregation, possibly over windows of time. Expressed in SQL, such a series of steps results in a deeply nested query that is neither intuitive to read nor easy to maintain. If the pipeline has branches, it becomes even more complex. Clearly, SQL is not the most natural way to express data pipelines.

Many attempts have been made to come up with the right interface to describe processing pipelines, ranging from Workflows over MapReduce through specialized data flow languages such as Pig and Crunch to general purpose programming paradigms like Apache Spark, to name just a few batch-oriented approaches. Then there are streaming engines, including the likes of Storm, Spark Streaming, Samza, Flink, or Tigon. Each of these engines has their own strengths and sweet spots. Yet they all suffer from a major drawback: Their APIs are tightly coupled with a single (their own) execution engine, and typically work either only in batch or only in streaming. Porting a pipeline across these paradigms or to a different execution engine almost always requires a complete rewrite with little reuse of code.

For example, you have implemented a streaming pipeline. But now you want to apply that same logic to the last two years of historical data. How would you apply the same analytics pipeline on the historical data as the newly arriving data? Or what if you would like to increase the impact of a batch pipeline by porting it to real time. Wouldn’t it be great if you could take your existing pipeline, flip a switch, and voilà: it runs in a different engine? The solution to this challenge is an API that:

  • is powerful enough to express commonly used capabilities mentioned above,
  • decouples pipeline composition from the runtime engine, and
  • can as easily express batch and streaming semantics.

We at Cask have been thinking about this challenge and have been following Google Cloud Dataflow as a promising approach to solve it. Originally available in the Google Cloud and bound to execution in that ecosystem, Dataflow has been available as Open Source on github for over a year. Its programming model unifies batch and streaming, with a Java DSL to describe a pipeline, that can then be executed by runners in different environments and execution engines. In addition to the Google Cloud runner, there are already experimental runners on top of Apache Spark and Apache Flink.

Today Google proposed its Dataflow for incubation at the Apache Software Foundation, in collaboration with other companies that have contributed to the project, including Talend, Data Artisans, Cloudera, and Cask. This will extend its reach even further and fully open it up for porting to other execution engines. Dataflow still is in an early stage and will need to evolve: It will have to offer more runners, support other programming languages besides Java, prove that it can gain traction and adoption outside of the Google Cloud, and will need to compete with similar paradigms such as Cascading. Yet its promises of decoupling APIs from execution, of portability across environments and runners, are great prospects for the Big Data community.

Here at Cask we make Big Data accessible and easy to use, using abstractions that decouple application code from the implementation of storage engines and runtime environments. Dataflow is perfectly aligned with our philosophy, which is why we are excited to participate in this proposal and looking forward to collaborate with Google and others in the community.

<< Return to Cask Blog