Efficient Use of Hadoop Cluster with YARN Capacity Scheduler

Sreevatsan Raman is the Head of Engineering at Cask where he is driving the company's engineering initiatives. Prior to Cask, Sree designed and implemented big data infrastructure at Klout and Yahoo!

As organizations see an increase in Hadoop adoption, there is a spike in both the number of jobs that are run on a Hadoop cluster, as well as the number of tenants utilizing the cluster. Effectively utilizing a Hadoop cluster becomes important from an administration perspective. Consolidating data and allowing multiple tenants to share a cluster reduces the operational cost and is resource efficient compared to managing multiple smaller clusters.

With the introduction of YARN to the Hadoop ecosystem, a variety of applications can be run on a Hadoop cluster. Allocating cluster resources for different applications in a multi-tenant cluster is achieved using resource schedulers – Capacity Scheduler and Fair Scheduler. In this post, we will take a look at the Capacity Scheduler and how it enables sharing cluster resources.

Capacity Scheduler

The Capacity Scheduler is designed to share the cluster resources using queues. The capacity of each queue specifies a percentage of cluster resources that are available for applications submitted to the queue. The queues can be set up in a hierarchy that reflects the resource requirements of an organization. A queue named root is at the top of the hierarchy.

For example, if an admin wants to share a Hadoop cluster between analytics and personalization dev teams equally, the admin can configure two different queues (analytics and personalization) to use a percentage of the cluster resources. Each team would submit their YARN application using its own queue.

To configure the queues between analytics and personalization, use the following setting (conf/capacity-scheduler.xml).

<property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>analytics,personalization</value>
    <description> Configure queues for dev and qa </description>
</property>

Once the queues are set up, configuring capacity is pretty straightforward. Update the Capacity Scheduler config to match the resources needs. In the sample config below, the cluster capacity is split 50-50 between the two queues (conf/capacity-scheduler.xml)

<property>
    <name>yarn.scheduler.capacity.root.analytics.capacity</name>
    <value>50</value>
    <description>Queue capacity for analytics team.</description>
</property>

<property>
    <name>yarn.scheduler.capacity.root.personalization.capacity</name>
    <value>50</value>
    <description>Target Queue capacity for personalization.</description>
</property>

The configuration above specifies minimum compute capacity for analytics and personalization queues with 50% each. If the cluster is not utilized at max capacity, any idle capacity can be taken by applications submitted to either queue.

Once the Capacity Scheduler is configured, the queue property can be set to submit jobs in the specific queue.

// Create the configuration object
Configuration conf = new Configuration();
// Set the property
conf.set(“mapreduce.job.queuename”,  “analytics”);
// Submit
JobID id = AnalyticsMapReduceJob.submit(“My analytics job”, conf);

Now let’s take a look at another example of how Capacity Scheduler is used in CDAP, an open-source data application platform that runs on Hadoop, to share resources of the cluster between different tenants.

Example: queue per namespace in multi-tenant CDAP

Namespaces in CDAP allow users to isolate data and applications on a shared CDAP cluster. The idea is for each tenant in a multi-tenant environment to create their own namespace to isolate their data and programs.

In CDAP each namespace can be configured to submit programs to different queues. To do that you need to add a namespace level property scheduler.queue.name using the REST API:

PUT host:port/v3/namespaces/personalization/properties {“config”: {“scheduler.queue.name”: “root.personalization”}}

After setting the property for the personalization namespace, all the jobs that are run in personalization namespace will be submitted to the personalization queue. The compute resources in YARN will be guaranteed as per queue configuration to all the programs in the namespace.

 

Interested in tackling similar problems to help more people develop applications on Hadoop? Join our mission!

  • Ratnesh

    Thanks for your post! Hadoop can be deployed on a variety of scales. There quirements at each of these will be different. Hadoop has a largenumber of tunable parameters that can be used to influence itsoperation. Furthermore, there are a number of other technologies whichcan be deployed with Hadoop for additional capab

<< Return to Cask Blog