Cask Hydrator and the Future of CDAP

Jonathan Gray, Founder & CEO of Cask, is an entrepreneur and software engineer with a background in open source and data. Prior to Cask, he was at Facebook working on projects like Facebook Messages. At startup Streamy, Jonathan was an early adopter of Hadoop and HBase committer.

The release of CDAP 3.2 comes at an exciting time of year for Cask with conference season in full swing, customers pushing into production, and the public release of new product features which set the stage for the future of CDAP.

Cask has always been committed to making it easy for developers and organizations of any kind to build and run big data applications. From simple ETL pipelines to complex anomaly detection systems, whether batch or realtime, CDAP aims to increase the productivity of developers by providing abstraction and integration above the open source infrastructure projects that make up the big data ecosystem. Until now, Cask has been primarily focused on enabling developers to build their own custom solutions using the Apache Hadoop ecosystem of technologies, supported by our close partners Cloudera, Hortonworks and MapR.

Cask Hydrator and our partnership with DataStax each represents a major step forward in the evolution of CDAP. Cask Hydrator is the first example of a new core capability of CDAP called Application Templates and our integration with Apache Cassandra is the first example of CDAP expanding beyond Hadoop.

App Templates and Cassandra

CDAP abstractions encapsulate complexity, enable reuse and provide portability, giving an increasing number of developers simpler access to big data and provides enterprises with the time-savings and flexibility to deliver solutions quickly and with the peace of mind of being future proof.

For example, CDAP Datasets standardize core data models like HDFS Files and HBase Tables and have always been part of the platform. CDAP provides implementations to support running on any distribution and allows developers to create new dataset types for higher-level APIs. These domain-specific data APIs provide reusable libraries that shield developers from low-level APIs and insulate applications from infrastructure logic and specific data stores and versions.

Today, applications using CDAP are insulated from specific versions and distributions of Apache HBase. As part of our partnership and collaboration with DataStax, we will extend this to support Apache Cassandra and DataStax Enterprise. Both are supported as part of Cask Hydrator today, but work is underway to integrate Cassandra as a native dataset, providing portability of Table-based applications to Cassandra, as well as transactional support via integration with Tephra.

Cask has also worked with customers to develop a number of applications like network analytics and social media sentiment analysis, providing developers with reference applications and starting points for common use cases. However, we still found our customers and the community spending far too much time struggling writing basic apps for the “simple” tasks like data ingestion, ETL, data as a service and data quality. Limited to hand-writing Java code and scripting together command-line tools, there is truly a significant burden on those tasked with setting up even the simplest of data lakes.

Application Templates are a new core capability that have been added in CDAP 3.2 that extend the dataset concept of individual data patterns to complete application patterns. Application Templates are based on the concepts of Applications and Plugins. An application can contain any number of programs like Spark, MapReduce, etc. and those programs can define and reference the API of a plugin.

Cask Hydrator

As the first example, Cask Hydrator is implemented as an application template for batch and realtime ETL. It defines plugin APIs for source, transform and sink. You can create instances of an ETL pipeline through JSON configuration. New sources, transforms and sinks can be easily developed as plugins in Java.

Cask Hydrator is open source and intended for developers, data miners and data scientists in order to provide simplified ingestion and ETL for the rapid enablement of Hadoop Data Lakes. In addition to the extensible template and plugin framework, Cask Hydrator includes a rich and customizable user interface for self-service defining, deploying and monitoring of ETL pipelines. We maintain an open source repo for sources, transforms and sinks and currently provide support for a wide variety of sources and sinks including HDFS, HBase, Amazon S3, traditional RDBMS and EDWs, Kafka, Cassandra and ElasticSearch.

We believe Cask Hydrator is a much needed addition to the Hadoop community as a completely open source, fully extensible, Java-based, Hadoop-native ETL framework with a user interface on top to extend it into the hands of many more users. Utilizing the application template framework, Cask will continue to develop applications like Hydrator to radically simplify and accelerate common use cases. In addition, advanced developers can now build complex applications and expose simple flex points with plugins which enable a much broader set of developers to extend that application.

Customer Examples

Cask Hydrator has already been in the hands of customers, and some close watchers of CDAP may have noticed an early public release available in CDAP 3.1.1 without the branding. We already have a number of projects well underway.

The primary usage so far for Cask Hydrator has been to enable existing big data infrastructure teams to have a flexible and extensible ETL framework that can be easily operationalized and supported in production. Currently people are struggling with manual processes based on things like MapReduce code, Oozie workflows and Sqoop imports cobbled together with scripts and cron. Cask Hydrator enables this team to build their transforms into reusable plugins, utilize existing database and NoSQL connectors to load data from external sources and manage and operationalize those jobs with the power of CDAP.

Several customers with large data science teams are working with Cask Hydrator to enable better self-service access to internal data lakes. Currently data scientists are required to either file tickets and wait for others to load datasets for them, or must take the burden (see above) on themselves to get it done. By exposing the Cask Hydrator user interface, data scientists can point and click to configure and deploy ETL jobs based on the options made available by the infrastructure team.

Finally, one of the examples I’m most excited about is with an advanced customer who is using Cask Hydrator to fundamental alter their internal process. A large user of Kafka, this customer provides a simple way for any individual team to write data into a Kafka topic and have that data land in Hadoop. Historically the events written into Kafka had to be in a specific, canonical format with certain fields defined and cleansed, to ensure simple downstream integration with the analytics pipelines. However, this polluted applications with specifics of the analytics system, and the use of multiple languages across the organization further led to complications in standardizing events. Utilizing Cask Hydrator to enable each individual team to maintain their own ETL pipelines, they can now write to Kafka using any format they choose and perform the necessary data preparations using pre-built transforms in Cask Hydrator.

The Future is Apps, The Future is Diverse

Application Templates and our expansion beyond Hadoop are indicative of two of the things we believe strongly at Cask.

The first is that the future of big data is applications. When analytics turn into applications they drive tremendous business value. By continuing to focus on enabling more developers and organizations to develop more applications faster, we hope to accelerate the progress of the ecosystem and create disruptive business value. Application templates, and frameworks like Cask Hydrator, can serve to further reduce the burden on developers and provide far greater reuse and collaboration within and between organizations.

The second is that the future of big data is diverse. Open source drives innovation in software today and has led to an unheard of pace of new projects and technologies, each different in a different way, all created to solve real problems but not always created for enterprises and not always integrated into existing systems. The need for a standardized layer to insulate applications and organizations from the myriad open source infrastructure systems that exist today, and the many more to come.

If you’re attending Strata + Hadoop World 2015 in New York City next week, come check us out at our kiosk 751 all of next week, and see my talks on CDAP Use Cases and on Cask Hydrator.
<< Return to Cask Blog