Cask Blog

Tephra: A Transaction engine for HBase – moves to Apache Incubation!

Cask Data Application Platform (CDAP) simplifies Big Data application development by abstracting many of Hadoop’s complexities and enabling developers to use familiar skills. We found that one of the best ways to simplify distributed programs is to have exactly-once processing semantics. Having exactly-once processing makes it easy to reason about the state of the system … Read more


A Data Quality Application Template for CDAP

Shilpa Subrahmanyam

One of Cask’s core goals is making a reasonably-experienced Java developer’s life much easier when building Hadoop applications. My summer project was aligned with the company’s effort to take this to the next level by lowering the barrier to entry for using Hadoop even further — Java proficiency not required. I spent my summer writing … Read more



AeroCask – Real-time Flight Data Analytics using CDAP

One of the many things that I love about Cask are the hackathons before every release. It is not only a way for us to dog-food new features in the CDAP platform but it is also an opportunity to let your imagination run loose and implement an integration with another system; or develop an interesting … Read more


CDAP 3.1 adds MapR support, Spark integration, enhanced Datasets and much more!

Shankar Selvam

We are excited to announce the release of the Cask Data Application Platform (CDAP) v3.1.0.  In this release we have added support for MapR, that provides users with more distro choice when using  CDAP. Furthermore, this release expands our footprint to support CDH 5.4, HDP 2.2 and Apache Hadoop with Hbase 1.0 and Hive 1.1. … Read more


What is Hadoop, anyway?

Recently my co-worker Derek posted an article about which versions of Hadoop infrastructure components are included in the various distributions. One of the reactions was this tweet, questioning whether such things as HBase and Spark should be considered part of core Hadoop: Even though this tweet was purely about core Hadoop, it made me think … Read more


Multitenancy for Hadoop: Namespaces – Part II

Bhooshan Mogal

We introduced the concept of namespaces and how it helps to bring multitenancy to Apache Hadoop in a previous blog. We also briefly introduced the use of namespaces in CDAP,  leaving out the implementation details. In this blog we’ll discuss some of the requirements that influenced the design of namespaces in CDAP, as well as … Read more


Hadoop Components Versions in Distros Matrix

The Apache Hadoop ecosystem is always evolving, with the major distributions constantly upgrading their included core Hadoop components. This can present a challenge when building any application which runs on top of Hadoop. When developing our open-source application framework, CDAP, we strive to maintain compatibility with all major Hadoop distributions. Building on our previous reference … Read more


Scalable Distributed Transactional Queues on HBase

A real time stream processing framework usually involves two fundamental constructs: processors and queues. A processor reads events from a queue, executes user code to process them, and optionally writing events to another queue for additional downstream processors to consume. Queues are provided and managed by the framework. Queues transfer data and act as a … Read more