Python for Big Data Programming

Tony Duarte

Tony Duarte is the Training Specialist at Cask -- where he defines, writes, and delivers courses on the Cask Data Application Platform (CDAP). Tony has spent over a decade working with Big Data - using Hadoop and Spark at Cloudera, Hortonworks, and MapR. Before becoming a Big Data Trainer, Tony was a Senior Kernel Engineer doing Operating Systems development and Database Kernel development.

Tony Duarte


Drag ‘n Drop Data Processing Pipeline in CDAP

The Application Development Framework in CDAP lets Developers build entire Applications for Big Data. But perhaps the most exciting development here at Cask is our new plugin architecture, which has allowed us to build a Drag ‘n Drop UI for Apache Hadoop and Apache Spark. Your Spark or Hadoop code becomes a plugin – and that means that you can simply plug-in your code to a Hadoop job that we start, or even plug-in your code to a Spark DStream that CDAP will launch for you.  And this new capability means that you can design your plugins as part of a microservice architecture – independently scalable, stoppable, and replaceable.

Code Walk-Through – A Simple Python Plugin

The diagram above shows how the GUI of CDAP represents a pipeline which uses the Python Evaluator plugin.  In this pipeline, we read some HDFS files produced by a prior Spark job, process them with Python, and then write them to HDFS. Other than the Python code that I typed in, this was created entirely by Drag ‘n Drop.

This is the code I typed into the GUI panel for the Python Evaluator Plugin:

(1)  CDAP has defined a standard contract for all your Python plugins. You receive, in sequence, each record being processed by the mapper. That’s because your Python code is actually co-located with the mapper and wired inside it as it processes each key,value pair. The record is presented to you as a Python dictionary, and will have a structure which corresponds to your configurable data schema. By default, a mapper has offset and body – corresponding to LongWritable and Text of TextInputFormat of a Hadoop job. The contract will be the same if you use this code in a Spark job. For Spark, we use your Python code as a closure to operate on a Spark RDD.  For either Spark or Hadoop, the structure of your data is configurable using a simple GUI configuration pane.

(2)  You also receive an emitter object, to which you can assign a different structure to emit for your returned records. In this sample code, we use the same schema for incoming and outgoing records. The context object you receive allows you to access existing Framework data structures.

(3)  We ship Jython to the YARN cluster, and it will be able to use the Standard Library which ships with Jython. Right now, we don’t have support for dynamically shipping other python libraries, but we plan to add it in the future..

(4)  The CDAP Development Framework redirects standard-out to log files on an edge node. They can be viewed from the GUI, and many people examine the cdap.log file directly when using the SDK – since it is easy to use an editor to find your logs while you are developing.

(5)  This shows the offset (the default Hadoop LongWritable key field) being accessed in a Python dictionary.  The three X’s in the debug output are there just to make them easy to find with an editor search. (The logs can grow quite large – especially if you are on a production cluster.)

(6)  For a MapReduce job the emitter returns control back to the mapper. For a Spark job the emitter is the return from the end of the closure CDAP creates with your Python code. In either case, you use the emitter method to return your processed data.

Deep Dive into the Pipeline – Apache Hadoop Execution

The pipeline executing the code launches several MapReduce jobs on your YARN cluster – assuming that you selected the MapReduce engine rather than the Spark engine to execute your Python pipeline.  CDAP will schedule a MapReduce job for each plugin you’ve dragged onto the Studio GUI.  Each set of MapReduce jobs associated with a plugin is individually scalable from the GUI, making your application adaptable as data volume changes over time.

For Apache Hadoop, CDAP co-locates your Jython code with the mapper on the cluster

What you see in the diagram is a MapReduce job being used to read your input files. Another MapReduce job runs your Python, and another MapReduce job saves your data to HDFS. In practice, the number of mappers created by default is equal to the number of input splits – as is standard for Hadoop.  So you really have multiple mappers at each stage of processing. In general, the CDAP Framework writes out intermediate data to HDFS – specifically, using partitioned files so that multiple readers and writers can scale as necessary. When possible, CDAP will try to minimize the amount of intermediate storage required, though this optimization has been ignored in this example for illustrative purposes. The code in the associated plugins is shipped to the nodes on your cluster – so everything is fully parallelized and local. The design is entirely scalable.

Deep Dive into the Pipeline – Apache Spark Execution

When you deploy your pipeline you are given a choice – use MapReduce or Spark to run the pipeline. The same code you write can be used by either one. If you notice, you wrote the Python code against a specific contract – and CDAP keeps that contract whether you are running with MapReduce, Spark RDDs, or Spark DStreams. CDAP is the first and only framework to provide this kind of flexibility.

For Apache Spark, CDAP runs on a cluster edge node to schedule and monitor jobs

The diagram shows that CDAP runs a daemon on your cluster edge node. That is where CDAP launches your Spark driver. The Spark driver works as it normally would – sending functions as closures to the Executors on your cluster. CDAP takes care of bundling the plugin code – your Jython – and transmitting that to each Executor. CDAP also takes care of packaging your Jython as a closure to operate against a Spark RDD constructed by CDAP.

Your Jython code has effectively become a Spark microservice – an independent service which can be scaled, rebuilt, redeployed, and re-architected as needed. For developers, that means that they can compile a new service and hand it off to Operations – and they can drag and drop it into the GUI.  Here at Cask, we like to think that we are creating the microservice architecture for the future of Big Data – seamlessly integrating Apache Spark, Apache Hadoop, Apache Hive, Apache HBase,YARN and other Big Data technologies, so that developers can focus on solving the business problem – while CDAP seamlessly handles the integration and deployment issues for them.

I encourage you to download the latest CDAP and give it a spin. We actively welcome questions, comments and suggestions. Our user group is a great place to engage with the Cask team and the entire CDAP community.

  • Alan Sanie

    Slick technology and a great article. Thanks.

  • Manzoor

    Hadoop provides platform to store large volumes of data on distributed
    file system which is reliable, flexible, economical and scalable
    solution. There are multiple solutions available to analyse this huge
    data like Mapredue, Hive and Pig to uncover correlations and patterns
    that provides insights on making better business decisions.

    Big data hadoop is an emerging technology and the job opportunity in this field is quite high . so master professional knowledge in hadoop platform to boost your career. Get more details about big data hadoop training and certification process at

<< Return to Cask Blog