Announcing CDAP 3.2 – Hydrator and much more!

Bhooshan Mogal

Bhooshan Mogal is a Software Engineer at Cask, where he is working on making data application development fun and simple. Before Cask, he worked on a unified storage abstraction for Hadoop at Pivotal and personalization systems at Yahoo.

Bhooshan Mogal

We are excited to announce the Cask Data Application Platform (CDAP) 3.2 release.

This release brings many enhancements to existing CDAP features as well as lays the foundation for upcoming, advanced features—all designed to further simplify data application development.

Cask Hydrator

CDAP 3.2 introduces Cask Hydrator—a highly functional framework and UI to support self-service batch and real-time data ingestion, and ETL. Hydrator provides CDAP users a code-free way to configure, deploy, and operationalize ingestion pipelines from different types of data sources.

This release also adds multiple new features to the existing ETL framework in CDAP, including writing to multiple sinks and an extensible data validation layer. With enhancements to the existing ETL template, CDAP 3.2 formalizes the concept of Application Templates—first introduced in CDAP 3.0—to all applications. To achieve this, we’ve introduced Artifacts, which is application code that can be extended with configuration and plugins. For example, an artifact can consist of a flow writing to a dataset, and multiple applications configured to write to different datasets can be created using that artifact. Without artifacts, users would’ve needed to write redundant application code with the dataset name hardcoded instead of configured at application creation time.

We have also added a Data Quality Application as the third built-in CDAP Application after ETL Batch and ETL Real-time. The Data Quality application helps users monitor the quality of data consumed or produced by other applications.

Metadata, Data Discovery, Auditing and Lineage

CDAP 3.2 includes features for capturing and curating metadata. With this release, users can annotate CDAP entities (applications, programs, datasets, and streams) with both tags (labels) and properties (key-value pairs).

Data discovery allows users to search for CDAP entities based on metadata. For this, we’ve added prefix-based search capabilities in CDAP for the first time. This paves the way for extended search capabilities, including integration with full-text search engines in upcoming releases.

CDAP 3.2 automatically captures an audit trail of program and dataset interactions. Lineage can be generated by CDAP and shows how datasets and programs are related to each other. This allows users to understand precisely how a dataset was modified or read-from in a given time interval, by whom, and with what parameters. This information can be extremely useful when tracking trusted or sensitive datasets, to identify which processes or data may have been impacted by an anomalous dataset, and so on.

Views

CDAP 3.2 introduces Views. Designed to simplify schema-on-read, Views provide a read-only view of a stream, with a specific read format. Read formats consist of a schema and a format (including CSV, TSV, Avro, amongst others). In future releases of CDAP, this capability will be extended further to allow users to create their own custom parsers.

When CDAP Explore is enabled, a Hive table is automatically created for you to perform ad-hoc queries on a view. In 3.2, Views are only available on Streams. However, later versions of CDAP will add support for Datasets.

Datasets and MapReduce

This release features a major improvement to Datasets and MapReduce programs, enabling users to write to multiple datasets as well as to multiple partitions of a dataset from MapReduce programs. This capability enables writing to multiple sinks in the ETL framework (mentioned earlier in this blog), but also serves general-purpose use-cases which may require writing to different datasets or partitions based on the data being written.

Hadoop Distribution Support

Lastly, CDAP 3.2  supports the latest versions of Apache HBase™ (1.1) and Hortonworks Data platform (2.3). It also integrates with Apache Ambari, enabling Hortonworks customers to use Ambari to install CDAP.

 

Download CDAP 3.2 today and take it for a spin! Also consider helping us develop the platform by reaching out to the community with any comments, feedback, suggestions, or improvements or by creating and following JIRA issues and submitting pull requests.

  • Deepak Dixit

    While using normal stream client I have observed that large files are taking time in uploading as a datasets. Would hydrator help in this? Or is there any way to speed up ingestion process?

    • Nitin

      Hi Deepak,

      Do you want to write data to Stream ? or you are just trying to get data from HDFS or Local Filesystem into a Dataset ?

      Thanks,
      Nitin

      • Deepak Dixit

        I am getting data from local filesystem.

        • Nitin

          If you are using Standalone then yes. Hydrator has a Stream Source in Batch ETL that would you to read it in parallel.

<< Return to Cask Blog