Simplifying YARN – Introducing Weave to the Apache Hadoop Community

Nitin Motgi is co-founder of Cask, where he leads engineering. Prior to Cask, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E. He previously held senior engineering roles at Altera and FedEx.

This post is the first in series that introduces Weave, Continuuity’s recent contribution to the Apache Hadoop community.  Through this blog series, we will share our experiences with YARN, the motivation for building Weave, and provide an overview of the Weave architecture along with use cases that show the problems Weave can help you solve.

The Power of YARN

At Continuuity, Apache YARN is an integral part of our products because it offers broad capabilities and a long-term vision for supporting a diverse set of applications and processing patterns. One such product is BigFlow, our real-time, distributed, stream-processing engine. YARN is used to deploy BigFlow applications and support lifecycle management and runtime elasticity. As we have built on YARN, we have come to realize that it is extremely powerful, however leveraging its full capacity is challenging. YARN can be difficult to get started with, cumbersome to test and debug, and complex for building new non-MapReduce applications and frameworks.

YARN is a new framework for writing distributed frameworks and applications. YARN stands for “Yet Another Resource Negotiator” and began as a re-architecting effort for solving issues observed with the previous Hadoop Map-Reduce framework on large distributed clusters – particularly with JobTracker (popularly known as JT). The old version of JobTracker was responsible for two major functions on a distributed cluster:

  • Cluster Resource Management and
  • Application Lifecycle Management for all applications running on the cluster

Because JT was responsible for all applications, the cluster was inherently limited, resulting in wasted resources or limited scalability.

This issue drove the community to re-think and design a better approach for JT responsibilities with the goal of achieving higher resource utilization, improved scalability and a more flexible architecture. In doing so, they decided to have a framework that separated the above-mentioned critical functions. In the new system, cluster resource management is YARN’s purpose and application lifecycle management is moved into user space with support from YARN-provided APIs. Moving application lifecycle management into user space opened the door for supporting alternative application frameworks and data processing paradigms. Now, clusters can be used to concurrently run jobs that access and process data using computational paradigms that are non-MapReduce, like MPI and Streaming. The most popular framework to date built on YARN is MRv2. MRv2 is a framework that allows running MapReduce jobs. There have also been other efforts like Hamster (Open MPI) that are bringing other popular frameworks to Hadoop through YARN.

The Challenge with YARN

YARN is powerful but not always easy to use. Setup is simple but YARN is not easy to build new frameworks or applications with. Building directly against the YARN API is difficult as it is intentionally low-level and verbose. Out of the box, YARN lacks application developer friendly APIs, application state management, service discovery, request routing, application monitoring, application logging and a simple testing framework. Members of the community have identified these gaps, which may hinder wider adoption of YARN. As we ourselves recognized these gaps while building our products on YARN, we decided that for the better of YARN and the community we needed to build Weave.

Weave – The Vision

The vision for Weave is to enable developers to easily harness the power of YARN, with a simple programming model and reusable components for building distributed frameworks or applications. Weave hides the complexity of YARN with a programming model that closely resembles the Java thread model. Weave supports running and managing applications with robust application lifecycle management support, support for long running processes, centralization of logs and metrics on the client, service registration and discovery capabilities. Additionally, Weave applications can be easily tested and run locally in threads or on the cluster. Weave also comes with a built-in Application Master to support simple applications.

We are happy to announce that we are open sourcing Weave under the Apache 2.0 license. Visit http://weave.continuuity.com to find out more.

In next two blogs posts, we will dive into the Weave architecture and use-cases.

Leave a comment