Integrating CDAP with Microsoft Azure HDInsight

Derek Wood is a DevOps Engineer at Cask where he is building tools to manage and operate the next generation of Big Data applications. Prior to Cask, Derek ran large scale distributed systems at Wells Fargo and at Yahoo!, where he was the senior engineering lead for the CORE content personalization platform

We recently announced the integration of CDAP with the Microsoft Azure HDInsight platform. This post will give a behind-the-scenes look at this integration.

First, a bit about the integration itself. Azure HDInsight is an Apache Hadoop and Spark distribution powered by the cloud. This means that it handles any amount of data, scaling from terabytes to petabytes on demand. Spin up any number of nodes at any time, but only pay for the compute and storage that you use. HDInsight Application Platform is an easy way to distribute, discover and install solutions or applications that you have built for the Apache Hadoop ecosystem. CDAP is among the first applications available for HDInsight “HBase” clusters.

To see a demo of how to accelerate time to value from data using CDAP on Azure HDInsight, please join the webinar titled: “Building Modern Data Apps on Microsoft Azure HDInsight” on November 9, featuring Nitin Motgi, CTO and co-founder at Cask, and Pranav Rastogi, Principal Program Manager at Microsoft Azure HDInsight.

This section highlights some of the key takeaways which could be useful for building applications for HDInsight Application Platform.

Takeaways

Azure HDInsight separates compute and storage. Data processed by HDInsight can be stored in Azure Storage (HDFS-compatible) or Azure Data Lake Store. Please read this link for more information. This allows us to delete clusters when they are not in use without impacting data thus reducing our cost. HDInsight implements WASB as the protocol scheme. Since this is WebHDFS compatible, we had to update CDAP to support wasb to access the Hadoop File System.

Our initial tests determined the standard block blobs used by WASB do not support the hsync() api, used both by CDAP’s transaction system – Apache Tephra (incubating) – and it’s real-time ingestion using Streams. Fortunately though, Azure also offers page blobs, which are optimized for random read/write access, and which HDInsight uses for HBase’s own write-ahead log. Additionally, hsync() support had just been added as well, which made page blobs the clear choice for us. Together, between Microsoft and us, we were able to make progress quickly.

The next task was writing the automation to provision CDAP.  Azure’s wealth of deployment options can be a bit overwhelming at first, with solution templates, Resource Manager deployments vs Classic deployments, etc.  Fortunately there were a few relevant examples to use, particularly the Hue example.  Using this as a guide, we were able to quickly establish the framework for the automation.

In the latest version of the HDInsight platform, each application is given its own edge node, fully configured with an Ambari client to interact with the cluster. During deployment, you can specify a sequence of one or more scripts to be run on the edge node to install your application.

The script to install and configure CDAP is relatively straightforward. We decided to leverage a lot of our existing automation and use Chef for the heavy-lifting. This script clones the CDAP repo, sets up a Chef environment, generates a configuration for the CDAP cookbook, and executes the Chef run. The generation of the CDAP configuration actually reuses some common functions from the CDAP init framework in order to parse the Hadoop client configurations and determine the Zookeeper quorum. It also reuses some of the same build scripts for building the CDAP VM in order to setup the Chef environment. Having reusable automation throughout the CDAP ecosystem got us up and running quickly.

Customizing CDAP on HDInsight

Currently, an HDInsight cluster with the default configuration results in a Yarn capacity of 22 GB and 32 vcores. Of this, CDAP itself will use 12 GB and 8 vcores. It would be great if a user could customize these settings via the Azure UI when provisioning. Until then, one needs to log in and customize /etc/cdap/conf/cdap-site.xml and restart CDAP. Another nice-to-have feature would be to be able to notify users that their cluster will be restarted when adding CDAP to an existing cluster.

Conclusion

Overall, we are proud to be able to offer such an easy way to provision a CDAP instance to Azure HDInsight. The process was a bit overwhelming at first, with all the different options available. But Microsoft gave us great support, including holding an in-person meeting with us. Since the time we started this integration, the Microsoft team has been actively improving the framework. Please try it out and share your experience.

For more information on how to use CDAP on HDInsight, please read the blog titled: “Using CDAP on HDInsight” by Bharath Sreenivas, Software Development Engineer at Microsoft.

<< Return to Cask Blog