Cask Blog

CDAP 3.1 adds MapR support, Spark integration, enhanced Datasets and much more!

Shankar Selvam

We are excited to announce the release of the Cask Data Application Platform (CDAP) v3.1.0.  In this release we have added support for MapR, that provides users with more distro choice when using  CDAP. Furthermore, this release expands our footprint to support CDH 5.4, HDP 2.2 and Apache Hadoop with Hbase 1.0 and Hive 1.1. … Read more


What is Hadoop, anyway?

Recently my co-worker Derek posted an article about which versions of Hadoop infrastructure components are included in the various distributions. One of the reactions was this tweet, questioning whether such things as HBase and Spark should be considered part of core Hadoop: Even though this tweet was purely about core Hadoop, it made me think … Read more


Multitenancy for Hadoop: Namespaces – Part II

Bhooshan Mogal

We introduced the concept of namespaces and how it helps to bring multitenancy to Apache Hadoop in a previous blog. We also briefly introduced the use of namespaces in CDAP,  leaving out the implementation details. In this blog we’ll discuss some of the requirements that influenced the design of namespaces in CDAP, as well as … Read more


Hadoop Components Versions in Distros Matrix

The Apache Hadoop ecosystem is always evolving, with the major distributions constantly upgrading their included core Hadoop components. This can present a challenge when building any application which runs on top of Hadoop. When developing our open-source application framework, CDAP, we strive to maintain compatibility with all major Hadoop distributions. Building on our previous reference … Read more


Weblog Analytics on Apache Hadoop™

Hadoop provides specialized tools and technologies that can be used for transporting and processing huge amounts of weblog data. In this blog, we’ll explore the end-to-end process of aggregating logs, processing them and generating analytics on Hadoop to gain insights about how users interact with your website. With the digitization of the world, generating knowledge … Read more


Multitenancy for Hadoop: Namespaces

Bhooshan Mogal

As a data processing platform, Hadoop‘s popularity today is often attributed to its cost-effectiveness, derived equally from the usage of commodity hardware and from the ability to co-locate work on shared compute and storage resources. Sharing resources allows organizations to maximize the throughput and utilization of a small number of large clusters instead of managing a large … Read more


Data-driven job scheduling in Hadoop

Julien Guery

Triggering the processing of data in Hadoop—as soon as enough new data is available—helps optimize many incremental data processing use-cases, but is not trivial to implement. The ability to schedule a job (such as MapReduce or Spark) to run as soon as there’s a certain amount of unprocessed data available—for instance, in a set of … Read more



How we built it: Making Hadoop data exploration easier with Ad-hoc SQL Queries

Please note: Continuuity is now known as Cask, and Continuuity Reactor is now known as the Cask Data Application Platform (CDAP). We are excited to introduce a new feature added in the latest 2.3 release of Continuuity Reactor – ad-hoc querying of Datasets. Datasets are high-level abstractions over common data patterns. Reactor Datasets provide a … Read more


Running Presto over Apache Twill

Alvin Wang

Please note: Continuuity is now known as Cask, and Continuuity Reactor is now known as the Cask Data Application Platform (CDAP). We open-sourced Apache Twill with the goal of enabling developers to easily harness the power of YARN using a simple programming framework and reusable components for building distributed applications. Twill hides the complexity of … Read more