Cask Blog

CDAP Workflows: In Comparison with Apache Oozie

Bhooshan Mogal

Apache Oozie is a workflow scheduler system to manage Apache Hadoop™ jobs. It is one of the most popular open-source workflow scheduler systems for Hadoop. Cask Data Application Platform (CDAP) is an open-source platform to build and deploy data applications on Hadoop. CDAP provides abstractions on top of Hadoop that enable developers to rapidly build, … Read more


AeroCask – Real-time Flight Data Analytics using CDAP

One of the many things that I love about Cask are the hackathons before every release. It is not only a way for us to dog-food new features in the CDAP platform but it is also an opportunity to let your imagination run loose and implement an integration with another system; or develop an interesting … Read more


What is Hadoop, anyway?

Recently my co-worker Derek posted an article about which versions of Hadoop infrastructure components are included in the various distributions. One of the reactions was this tweet, questioning whether such things as HBase and Spark should be considered part of core Hadoop: Even though this tweet was purely about core Hadoop, it made me think … Read more


A Look at Automating Cluster Creation in the Cloud with Coopr

David Bajot

Coopr is a cluster provisioning system designed to fully facilitate cluster lifecycle management in public and private clouds. In this blog, we will take an inside look at what happens when Coopr provisions a cluster. Deploying clusters can be time-consuming. For many system deployments, this work can be accomplished with a configuration management tool such … Read more


Multitenancy for Hadoop: Namespaces – Part II

Bhooshan Mogal

We introduced the concept of namespaces and how it helps to bring multitenancy to Apache Hadoop in a previous blog. We also briefly introduced the use of namespaces in CDAP,  leaving out the implementation details. In this blog we’ll discuss some of the requirements that influenced the design of namespaces in CDAP, as well as … Read more


Hadoop Components Versions in Distros Matrix

The Apache Hadoop ecosystem is always evolving, with the major distributions constantly upgrading their included core Hadoop components. This can present a challenge when building any application which runs on top of Hadoop. When developing our open-source application framework, CDAP, we strive to maintain compatibility with all major Hadoop distributions. Building on our previous reference … Read more





Efficient Use of Hadoop Cluster with YARN Capacity Scheduler

As organizations see an increase in Hadoop adoption, there is a spike in both the number of jobs that are run on a Hadoop cluster, as well as the number of tenants utilizing the cluster. Effectively utilizing a Hadoop cluster becomes important from an administration perspective. Consolidating data and allowing multiple tenants to share a … Read more