Cask Blog

Cask Tracker Enhanced: Metadata Taxonomy and Data Usage Analytics in CDAP 3.5

Yue Gao and Riwaz Poudyal

Cask Tracker is a self-service CDAP Extension that automatically captures rich metadata and provides users with visibility into how data is flowing into, out of, and within a Data Lake. Tracker was first introduced in CDAP v3.4. Tracker v0.2 has just been released along with CDAP 3.5 and packs a ton of new features. Dataset … Read more


Long Running Tests on CDAP

Vinisha Vyasa

At Cask we are huge proponents of test automation. We routinely run unit, integration, and performance tests to ensure the correctness and stability of our software. While these tests are great at catching a lot of issues early on, there are some aspects of distributed systems that are hard to capture with these tests – … Read more



A Hydrator Python Transform for Python nerds like you and me!

John Jackson

Before every CDAP release, we at Cask conduct an internal hackathon to use CDAP and work on interesting features. A few Cask engineers got together and, wanting to open up the capabilities of Cask Hydrator beyond Java developers, decided to build a transformation that uses user-written Python. Beginning with CDAP release 3.2, the CDAP UI … Read more


Running Legacy MapReduce Jobs in CDAP

Rohit Sinha

The Cask Data Application Platform is an integrated developer platform for the Hadoop ecosystem. With CDAP, developers can address a broader set of batch and real-time use-cases with easy-to-use abstractions. Developers can write MapReduce programs using CDAP and deploy them as CDAP applications easily, as explained in this guide. Running MapReduce programs inside CDAP has … Read more


Stream Views in CDAP

Alvin Wang

In a previous blog post, we outlined how schema-on-read works with streams. Schema-on-read features allows users to decouple data ingestion from exploration. In this post, we will see how users can attach multiple views on the same stream using a feature called stream views. Stream views provide a way to read from the same stream … Read more


SockJS + $resource = Awesomeness!

Ajai Narayanan

The Cask Data Application Platform (CDAP) is an open-source platform to build and deploy data applications on Apache Hadoop™. As of version 3.0, it includes a slick new user interface to help users deploy, manage and monitor their data applications. This UI provides real-time updates from the CDAP backend. Problem Statement Initiating too many HTTP … Read more


Caskalytics: Multi-log Analytics Application

Derek Tzeng & Jay Jin

This summer, we joined Cask as interns to work on Cask Data Application Platform (CDAP). Our project, internally codenamed Caskalytics, was creating an internal Operational Data Lake. In reality, a data lake is just a concept focused on storing data from disparate sources (real-time or batch, structured, unstructured or semi-structured) in a single big data … Read more


A Look at Automating Cluster Creation in the Cloud with Coopr

David Bajot

Coopr is a cluster provisioning system designed to fully facilitate cluster lifecycle management in public and private clouds. In this blog, we will take an inside look at what happens when Coopr provisions a cluster. Deploying clusters can be time-consuming. For many system deployments, this work can be accomplished with a configuration management tool such … Read more


CDAP 3.0 – From Zero to App in 5 minutes

The Cask Data Application Platform (CDAP) was created with the intent of empowering all developers to build data applications. It was, is and always will be a developer platform – a platform with the mission to provide developers with simple access to power technology. CDAP has proven to significantly lower the barriers to building Hadoop … Read more