Building the Cask Hydrator Backend

Albert Shau is a software engineer at Cask, where he is working to simplify data application development. Prior to Cask, he developed recommendation systems at Yahoo! and search systems at Box.

We introduced Cask Hydrator to provide an easy way for users to build a data lake. Users can create ETL pipelines by simply choosing a source, one or more sinks, and optional transforms. When we were designing Hydrator, we wanted to make sure that it was easy to use; users should be able to configure a pipeline without writing any code. We also wanted to make sure that the system was extensible so that users could plug in their own custom sources, sinks and transforms if they wanted to. In this blog post, we will cover the technical aspects of the Hydrator infrastructure. More specifically, we will look at how Hydrator uses the artifact, application configuration and plugin features newly added in CDAP 3.2.

Creating an ETL Pipeline

When users create an ETL pipeline using Hydrator, they are essentially using the UI to deploy a new application in CDAP. A CDAP application is a deployable unit of code and configuration that is composed of CDAP programs and datasets. An application consists of two main elements – the code (the JAR file that contains CDAP programs and Datasets) and the configuration. Prior to 3.2, both code and configuration were supplied during application deployment time. In order to make Hydrator easy to use, we did not want to require users to deploy code each time they create a pipeline. We wanted users to be able to reuse the same code, but supply different configurations to create different pipelines.

In order to allow code reuse, we promoted the JAR file — or artifact — to a first-class citizen. In CDAP 3.2, users first add artifacts, then create applications by simply specifying an artifact name and version. This lets multiple applications use the same code, and lets users create applications without deploying code. It also enables explicit version management for applications.

creating-a-pipeline
 

For Hydrator, we have included two ETL artifacts out of the box — one for batch pipelines and one for real-time pipelines. When you use Hydrator to create an ETL pipeline, you are really telling CDAP to create an application using one of these two ETL artifacts, and using some application configuration that specifies the pipeline structure.

Application Configuration

Also new in CDAP 3.2 is Application Configuration, which allows users to create multiple applications that use the same artifact, but are configured to behave differently from each other. In older versions of CDAP, each application you created needed to declare what programs and datasets it uses in the Java code. For example, if your application wrote to a Table dataset named “users”, your application class would hardcode that name:

public class ETLApp extends AbstractApplication {
 @Override
  public void configure() {
    createDataset("users", Table.class);
    addMapReduce(new ETLMapReduce());
  }
}

If you wanted to create another application that did exactly the same thing but wrote to a different Table, you would have had to edit the code. For Hydrator, this is clearly unacceptable. We don’t want users to write code for each new ETL pipeline. To solve this problem, we introduced application configuration. This new feature allows you to specify some configuration when you create an application. The configuration can then be accessed by the Java code to configure the application to use different programs and datasets: 

public class ETLApp extends AbstractApplication<ETLConfig> {
  public static class ETLConfig extends Config {
    // value of this field is injected by CDAP to be whatever
    // was passed in by the user 
    private String name;
  }
  @Override
  public void configure() {
    ETLConfig config = getConfig();
    createDataset(config.name, Table.class);
    addMapReduce(new ETLMapReduce());
  }
}

Our ETL artifacts use Application Configuration in order to specify what source, transforms, and sinks to use, as well as additional settings for each of them.  For example, the Twitter source reads the configuration in order to determine what credentials it should use when reading from Twitter, while the Database source reads the configuration in order to determine what host, port and table it is reading from.

Plugin Framework

Hydrator ships with a wide range (over thirty) of built-in sources, transforms and sinks. If those are not enough, there is also a way to add your own using the plugin framework. A plugin is simply a Java class that can be used by an application. Most of the time, a plugin is an implementation of an interface or abstract class defined in the application class. For Hydrator, the built-in ETL artifacts expose source, transform and sink interfaces that users can implement to extend Hydrator’s functionality.

plugin-framework

To create a custom source, a user would write Java code to implement that source, package it in a JAR file, then add that JAR as an artifact in CDAP. More detailed examples can be found in the CDAP plugins repository. An application could then be configured to use that new source in addition to all the others already present in the system. In fact, the built-in sources, transforms and sinks are all implemented as plugins. In this way, the plugin system lets you to create extensible types of applications.

Conclusion

From a design perspective, it was very important to us that nothing ETL specific snuck into the CDAP APIs. We introduced artifacts, application configuration and plugins not only because Hydrator requires them, but because we believe they are useful features for all types of applications. Try out Hydrator, these new features, and let us know what you think!

<< Return to Cask Blog