Cask Blog

Deploying CDAP packages from source via Coopr

Developing features for CDAP follows a similar workflow as working on many projects. Developers have their local checkout of the source, make modifications in a feature branch, build and test locally on their development machines, push their branch, and submit a pull request for code review. During this process, developers build CDAP clusters (for testing) … Read more


Weblog Analytics on Apache Hadoop™

Hadoop provides specialized tools and technologies that can be used for transporting and processing huge amounts of weblog data. In this blog, we’ll explore the end-to-end process of aggregating logs, processing them and generating analytics on Hadoop to gain insights about how users interact with your website. With the digitization of the world, generating knowledge … Read more


Hadoop Vendor OS Support Matrix

Developing our open source data application platform, CDAP, which runs on top of Apache™ Hadoop® can be a challenging task. It requires testing of many different configurations, on multiple vendors of Hadoop, and on lots of different distributions of Linux. Setting up and testing all of these configurations can be extremely difficult without a simple reference of supported Linux distributions … Read more


Multitenancy for Hadoop: Namespaces

Bhooshan Mogal

As a data processing platform, Hadoop‘s popularity today is often attributed to its cost-effectiveness, derived equally from the usage of commodity hardware and from the ability to co-locate work on shared compute and storage resources. Sharing resources allows organizations to maximize the throughput and utilization of a small number of large clusters instead of managing a large … Read more


Data-driven job scheduling in Hadoop

Julien Guery

Triggering the processing of data in Hadoop—as soon as enough new data is available—helps optimize many incremental data processing use-cases, but is not trivial to implement. The ability to schedule a job (such as MapReduce or Spark) to run as soon as there’s a certain amount of unprocessed data available—for instance, in a set of … Read more


CDAP v2.8.0 is out in the wild

I am very happy to announce that the latest release of our flagship product – the Cask Data Application Platform (CDAP) – v2.8.0 is now available for everyone to download. This release has a bunch of cool features that our customers, partners and the community want: Namespaces (provides application and data isolation that enables multi-tenancy) … Read more



How we built it: designing a globally consistent transaction engine

At Cask, we are committed to contributing back to the open source community. One of our latest open-sourced projects is Tephra, a system that adds complete transaction support to Apache HBase™. As an XA-style transaction system, Tephra is designed to be agnostic to the underlying data stores, so its usage is not limited to HBase. … Read more


Strata + Hadoop World NYC 2014 Recap: Four Trends in Hadoop

The Cask team had a great and productive time at Strata + Hadoop World earlier this month in New York City! We are very optimistic about the robust growth in Hadoop adoption, increased participation from a broad range of developers and companies in many industries, and continued maturation in the early days of this technology. As … Read more


Introducing Tigon: Real-time streaming for the real world

In collaboration with AT&T Labs, today we are releasing version 0.2.0 of the open source Tigon project, a real-time streaming analytics framework for Hadoop based on technology contributed by both companies. By combining AT&T’s low-latency and declarative language support with our durable, high-throughput computing capabilities and procedural language support, Tigon provides developers with a new … Read more