Efficient Use of Hadoop Cluster with YARN Capacity Scheduler

As organizations see an increase in Hadoop adoption, there is a spike in both the number of jobs that are run on a Hadoop cluster, as well as the number of tenants utilizing the cluster. Effectively utilizing a Hadoop cluster becomes important from an administration perspective. Consolidating data and allowing multiple tenants to share a … Read more


Deploying CDAP packages from source via Coopr

Developing features for CDAP follows a similar workflow as working on many projects. Developers have their local checkout of the source, make modifications in a feature branch, build and test locally on their development machines, push their branch, and submit a pull request for code review. During this process, developers build CDAP clusters (for testing) … Read more



Weblog Analytics on Apache Hadoop™

Hadoop provides specialized tools and technologies that can be used for transporting and processing huge amounts of weblog data. In this blog, we’ll explore the end-to-end process of aggregating logs, processing them and generating analytics on Hadoop to gain insights about how users interact with your website. With the digitization of the world, generating knowledge … Read more


Hadoop Vendor OS Support Matrix

Developing our open source data application platform, CDAP, which runs on top of Apache™ Hadoop® can be a challenging task. It requires testing of many different configurations, on multiple vendors of Hadoop, and on lots of different distributions of Linux. Setting up and testing all of these configurations can be extremely difficult without a simple reference of supported Linux distributions … Read more


Multitenancy for Hadoop: Namespaces

Bhooshan Mogal

As a data processing platform, Hadoop‘s popularity today is often attributed to its cost-effectiveness, derived equally from the usage of commodity hardware and from the ability to co-locate work on shared compute and storage resources. Sharing resources allows organizations to maximize the throughput and utilization of a small number of large clusters instead of managing a large … Read more