CDAP 4.1 – More Enterprise-Grade Hardening, Pre-Built Solutions and Enhanced UX

Nishith Nand

Nishith Nand is a software engineer at Cask where he builds the platform for the next generation of data applications. Prior to Cask, he was building high performance large scale distributed systems for Pepperdata and Hedvig.

Nishith Nand

We are happy to announce the release of Cask Data Application Platform (CDAP) version 4.1. This new release brings with it some major enhancements and significant new capabilities in the platform, as well as new, ready-to-use solutions offered via Cask Market.

CDAP 4.1 improves security by allowing fine grained secure impersonation. It introduces replication so administrators can set up Hot-Cold replication for CDAP. This release also brings in a completely redesigned Log Saver. The new Log Saver is pluggable, resilient, and more memory efficient. CDAP 4.1 also includes a long list of UI/UX improvements.

Screen Shot 2017-03-13 at 6.05.22 PM

Cask Market now features three new packaged solutions, designed to address common, but complex big data problems related to EDW Offload, an Event-Condition-Action (ECA) Framework for IoT, and HEDIS Healthcare Reporting. It now also offers a lot of new plugins, including Amazon DynamoDB, Real-time CDAP streaming source, Date transform, and Fast Filter.

Here are some of the new features in more detail.

Fine-Grained Secure Impersonation

Secure impersonation was introduced in CDAP version 3.5. It allowed system administrators to configure a user at the namespace level. Namespace-level impersonation means that every namespace has a single Kerberos principal that all programs in that namespace run as, and access resources as, which though useful, is restrictive for some use-cases.

Release 4.1 introduces Secure Impersonation at an application level. Application-level impersonation allows every application to specify a single Kerberos principal that all programs in that application run as, and access resources as. Any streams or datasets created by the application are owned by that user. Streams and datasets created outside of an application can be created with a principal; otherwise, they would be owned by the Kerberos principal defined for the namespace in which they are created. The owner information is pushed down to HDFS and HBase which are the underlying storage providers.

Hot-Cold Replication

To support Hot-Cold Replication, CDAP 4.1 includes a Service Provider Interface (SPI) for all Apache HBase DDL (e.g. create table, create namespace, drop table etc.) operations that happen through CDAP. Users can implement this SPI to plug in their own logic during the creation of HBase tables. This allows users to create tables in multiple clusters and setup replication between them. CDAP also provides a replication status tool that tracks the state of the master and slave clusters.

Program Resiliency

CDAP programs are now more resilient to the unavailability of underlying services such as HDFS, Yarn and HBase. Cluster administrators can configure retry policies for different program types, then override those policies at a namespace, application, program, or run level.

Improved User Experience

The CDAP UI in 4.1 provides an integrated experience that unifies visual pipelines (ex-Hydrator) and metadata management (ex-Tracker) with the rest of the CDAP. In addition, it significantly simplifies navigation with new and revamped entity detail pages, Call(s) to Action and a Just Added section. It also provides a new capability in visual pipelines to create dedicated error processing flows.

Screen Shot 2017-03-13 at 5.57.29 PM

It also includes a beta version of Data Preparation that allows users to transform data visually using a set of directives.

Screen Shot 2017-03-13 at 6.19.14 PM

Enhanced Logging Framework

Log Saver has been completely redesigned for CDAP 4.1. It is now pluggable to allow for new logging extensions.

CDAP’s logback architecture has been designed to follow the logback appender API, which allows users to develop custom appenders based on their needs and make them available to Log Saver.

Furthermore we have improved  ,the new architecture alters the rate of retrieving messages from Apache Kafka based on the processing rate and memory usage.

The new design now also sports better resiliency; in case the Log Saver is started on a backup cluster it can efficiently figure out which log event to resume processing from.

The logging location has been changed to improve security, and issues around log cleanup have been fixed.

New Packaged Solutions in Cask Market

With CDAP 4.1, the following pre-built and packaged Hadoop solutions, offered through Cask Market, are included:

EDW Offload helps customers increase business agility and drive significant cost savings by providing modern, self-service tools for moving expensive data warehouse storage and compute to Hadoop, while managing increasing data volume and variety. The Cask solution includes pre-built pipelines, drivers, transformation logic, and best practices to simplify moving data and workloads to Hadoop.

Event-Condition-Action (ECA) Framework for IoT offers a pre-built, code-free solution for an ECA framework for IoT applications.

It provides the following features

  • Continuous ingestion of any kind of data
  • Real time parsing of events
  • Creation of a dynamic, parallelizable, and distributed rules engine (code-free)
  • Pluggable actions such as a reliable notification service

This solution is Spark-native (leveraging Spark Streaming as the real-time engine), is easily configurable (no restarts required) and offers REST API’s for easily building applications with a custom UI.

HEDIS Healthcare Reporting enables healthcare organizations to calculate HEDIS measures faster and more frequently for increased compliance and performance. HEDIS Reporting on CDAP, which includes ingestion pipelines, Spark jobs, data API services, and a web based user interface, accelerates and automates data collection and calculations required for HEDIS measures.

New Pipeline Plugins in Cask Market

With the CDAP 4.1 release, Cask Market now offers a number of new plugins including Amazon DynamoDB, Real-time CDAP streaming source, Date transform, and Fast Filter. Fast filter transform makes it easy to filter the data flowing through a pipeline without affecting the throughput. Date transform allows the user to chose from various date formats and Unix epoch times. The new DynamoDB source and sink make it possible to fetch the data directly from Amazon’s NoSQL database service DynamoDB into CDAP pipelines and vice versa. Finally, the Real-time CDAP stream source plugin allows users to ingest data into CDAP pipelines in a realtime manner from CDAP streams, which is the easiest way to put data into CDAP.

Transaction Pruning

The Transaction Service keeps track of all invalid transactions so as to exclude their writes from all future reads. Over time, this invalid list can grow and lead to performance degradation. From CDAP 4.1 onwards, CDAP supports automated pruning of the invalid transactions list.

Download CDAP 4.1 and give it a spin. We actively welcome questions, comments and suggestions. Our user group is a great place to engage with the Cask team and the entire CDAP community.

<< Return to Cask Blog