How-to: Install and Use Cask Data Application Platform in Your Enterprise Data Hub

Albert Shau is a software engineer at Cask, where he is working to simplify data application development. Prior to Cask, he developed recommendation systems at Yahoo! and search systems at Box.

Cloudera customers can now install, launch, and monitor CDAP directly from Cloudera Manager.

Today we are very happy to introduce the integration of the Cloudera Enterprise Data Hub Edition (EDH) and the Cask Data Application Platform (CDAP). CDAP is an integrated platform for developers and organization to build, deploy and manage data applications on Hadoop. This initial integration will enable CDAP to be installed, configured, and managed from within Cloudera Manager, a component of EDH. Furthermore, it will simplify data ingestion for a variety of data sources, as well as enable interactive queries via Impala. Starting today, you can download and install CDAP directly from Cloudera’s downloads page.

 

In this post, you’ll learn how to get started by installing CDAP using Cloudera Manager. We have also created a video of this integration to support this post.

 

Installing CDAP with Cloudera Manager

To install CDAP on a cluster managed by Cloudera Manager, we have provided a CSD (Custom Service Descriptor). This CSD gives Cloudera Manager the required information on where to download CDAP from and how to configure and run CDAP services. To install CDAP CSD, first drop the downloaded jar (from Cloudera’s Download page or Cask’s Resources page) into the /opt/cloudera/csd directory on the Cloudera Manager server and restart Cloudera Manager. Cloudera’s full documentation on installing CSDs can be found here.

Once CSD is installed, the first thing you want to do is download the CDAP parcel from the Hosts à Parcels page. Note that by default, the CSD adds a Remote Parcel Repository URL for the latest version of CDAP at “http://repository.cask.co/parcels/cdap/latest/”. If desired, you can specify a particular version of CDAP, for example “http://repository.cask.co/parcels/cdap/2.7/”.

 

With a CDAP Parcel Repository URL configured, you will now see the CDAP Parcel available for download in the parcels page. From there, you can download, distribute, and activate the CDAP parcel on your cluster hosts.

 

Once the parcel is installed, the next step is to configure and start the CDAP services. Before doing so, note that there are a few additional outside requirements for running CDAP on your cluster. Please refer to the prerequisites section. Once all the prerequisites are satisfied, you can begin installing CDAP via the “Add Service” wizard.

 

The “Add Service” wizard will guide you through selecting the hosts you want to run CDAP on and customizing the configuration. A few things to note during the wizard:

  • CDAP consists of the Gateway/Router, Kafka, Master, and Web-App roles, and an optional Auth role. These services can all be thought of as “master” services. We recommend installing all roles together on a host, with multiple hosts for redundancy. Additionally there is a client “Gateway” role which can be installed on any host where it is desired to run CDAP client tools (such as cdap-cl)
  • There is an optional “Explore” capability for ad-hoc querying via Apache Hive. If you plan on using this, be sure to select the service dependency set containing Hive and check the “Explore Enabled” option on the configuration page
  • If you are installing CDAP on a Kerberos-enabled cluster, you must select the “Kerberos Auth Enabled” checkbox on the configuration page.

 

Untitled

 

Finally, sit back and watch as Cloudera Manager spins up your CDAP services! Once it completes, check out the CDAP Console from the “Quick Links” on the CDAP Service overview page. For more details on CM / CDAP Integration please view page here.

 

Ingesting data and exploring it with Impala

Streams are the primary means of bringing data from external systems into CDAP in real-time. They are ordered, time-partitioned sequences of data, usable for both real-time and batch collection and consumption of data. Using the CDAP Command Line Interface (CLI) can easily create streams.

 

First, connect to your CDAP instance using the CLI:

> connect <hostname>:11015

 

Next, create a Stream:

> create stream trades

 

You can then add events to a Stream one by one:

> send stream trades 'NFLX,441.07,50'
> send stream trades 'AAPL,118.63,100'
> send stream trades 'GOOG,528.48,10'

 

Alternately, you can add the entire contents of a file:

> load stream trades /my/path/trades.csv

 

Or you can use other tools or APIs available to ingest data in real-time or batch. For more information on what are other ways of ingesting data into CDAP – please refer to document here.

 

You can now examine the contents of your stream by executing a SQL query:

> execute 'select * from cdap_stream_trades limit 5'
+===========================================================================================+
| cdap_stream_trades.ts: BIGINT | cdap_stream_trades.headers: map<string,string> | cdap_stream_trades.body: STRING |
+===========================================================================================+
| 1422924257559        | {}                              | NFLX,441.07,50                  |
| 1422924261588        | {}                              | AAPL,118.63,100                 |
| 1422924265441        | {}                              | GOOG,528.48,10                  |
| 1422924291009        | {"content.type":"text/csv"}     | GOOG,538.53,18230               |
| 1422924291009        | {"content.type":"text/csv"}     | GOOG,538.17,100                 |
+===========================================================================================+
| 1422924261588        | {}                              | AAPL,118.63,100                 |
| 1422924265441        | {}                              | GOOG,528.48,10                  |
| 1422924291009        | {"content.type":"text/csv"}     | GOOG,538.53,18230               |
| 1422924291009        | {"content.type":"text/csv"}     | GOOG,538.17,100                 |
+===========================================================================================+

 

You can also attach a schema to your stream to enable more powerful queries:

> set stream format trades csv 'ticker string, price double, trades int'
> execute 'select ticker, sum(price * trades) / 1000000 as millions from cdap_stream_trades group by ticker order by millions desc'
+=====================================+
| ticker: STRING | millions: DOUBLE   |
+=====================================+
| AAPL           | 3121.8966341143905 |
| NFLX           | 866.0789117408007  |
| GOOG           | 469.01340359839986 |
+=====================================+

 

On one of our test clusters, the query above took just about two minutes to complete.

Data in CDAP is integrated with Apache Hive and the above query translates into a Hive query. As such, it will launch two MapReduce jobs in order to calculate the query results, which is why it takes minutes instead of seconds. To cut down query time, you can use

Impala to query the data instead of Hive. Since Streams are written in a custom format, they cannot be directly queried through Impala. Instead, you can create an Adapter that regularly reads Stream events and writes those events into files on HDFS that can then be queried by Impala. You can also do this through the CLI:

> create stream-conversion adapter ticks_adapter on trades frequency 10m format csv schema "ticker string, price double, trades int"

 

This command will create an Adapter that runs every 10 minutes, reads the last ten minutes of events from the Stream, and writes them to a file set that can be queried through Impala. The next time the Adapter runs, it will spawn a MapReduce job that reads all events added in the past ten minutes, writes each event to Avro encoded files, and registers a new partition in the Hive Metastore. We can then query the contents using Impala. On your cluster, use the Impala shell to connect to Impala:

$ impala-shell -i <impala-host>
> invalidate metadata
> select ticker, sum(price * trades) / 1000000 as millions from cdap_user_trades_converted group by ticker order by millions desc
+--------+-------------------+
| ticker | millions          |
+--------+-------------------+
| AAPL   | 3121.88477111439  |
| NFLX   | 866.0568582408006 |
| GOOG   | 469.0081187983999 |
+--------+-------------------+
Fetched 3 row(s) in 1.03s

 

Since we are using Impala, no MapReduce jobs are launched and the query comes back in a second.

Now that you have data in CDAP and are able to explore your data, you can use CDAP to build real-time, batch, or real-time and batch data application. For more information on how to build data applications using CDAP, please visit http://docs.cask.co.

 

 

<< Return to Cask Blog