How we built it: Making Hadoop data exploration easier with Ad-hoc SQL Queries

Poorna Chandra is a software engineer at Cask where he is responsible for building software fueling the next generation of data applications. Prior to Cask, he developed big data infrastructure at Greenplum and Yahoo!

Please note: Continuuity is now known as Cask, and Continuuity Reactor is now known as the Cask Data Application Platform (CDAP).
We are excited to introduce a new feature added in the latest 2.3 release of Continuuity Reactor – ad-hoc querying of Datasets. Datasets are high-level abstractions over common data patterns. Reactor Datasets provide a consistent view of your data whether processed in batch or real-time. In addition to scans and RPCs of Datasets, there is also a need to explore data through an easy-to-use SQL query interface. This allows you to run ad-hoc interactive queries on Datasets, generate reports on Datasets, and integrate Reactor with the BI tool of your choice. In this blog, we will talk about how this works and how we expose an easy-to-use interface for querying data.
In keeping with our mission of providing simple access to powerful technology, we set out to enable ad-hoc querying capabilities of Datasets without adding additional overhead to the user. We accomplished this by integrating the underlying technologies that enable declarative querying into Reactor. One of our important design goals was to introduce this feature without our users having to manage additional services or frameworks outside of Reactor. Below we will share how we built this system.

Hive as the SQL engine

SQL is a widely adopted standard for ad-hoc querying, and it’s a great way to make it easy for users to interact with Datasets. While there are a number of existing SQL engines in the Hadoop ecosystem, we decided to go with Apache Hive because one of its core components, Hive Metastore (which stores metadata about Hive tables), is also used by other existing SQL engines like Presto and Impala. This means that later integrations of those engines will be easier.

Datasets are non-native, external tables from Hive’s point of view. Hive has a well-defined interface for interacting with non-native tables via Storage Handlers. To integrate Hive with Datasets, we implemented a custom Dataset Storage Handler to manage the interactions. The Dataset Storage Handler is composed of:

  • a DatasetInputFormat that knows how to retrieve Dataset Records from Datasets, and
  • a DatasetSerDe that converts the records into objects that Hive can work with.

The data flow between Hive and Datasets is depicted below:

Hive as a Service in Reactor

Typically, a Hive installation consists of a Hive Metastore server that contains the metadata and a Hive Server that runs the user queries. In our case, we can connect to an existing Hive Metastore for metadata. However, connecting to an existing Hive server has the following limitations:

  • Hive Server does not have interfaces that allow access to the complete query life cycle. This limits transaction management while dealing with Datasets.
  • Hive Server runs user code in server space, which makes the service vulnerable to bad code.
  • Hive Server’s HTTP interface does not support security and multi-tenancy.

To overcome these limitations, we created a new HTTP-based Hive Server that addresses the above issues by wrapping Hive’s CLIService and integrating with Reactor’s Datasets and transactions engine Tephra. We chose to run the Hive Server in YARN containers for proper user code and resource isolation, as well as scalability.

Hive-Reactor Integration challenges

Since Hive jars bundle disparate libraries like Procotol Buffers and Guava, we came across a recurrent classloading issue. This is because Reactor uses different versions of these libraries. We had to make sure that the classpaths of the different components that run Hive in Reactor have their jars in a particular order. We fixed the classpath order in various places by doing the following:

  • In Reactor Master: since we only launch Hive service from Reactor Master, it was sufficient to isolate Hive jars in a separate classloader to avoid any library version conflicts.
  • In the YARN container running Hive service: we set the classpath of the container in the right order during its initialization since we have control over it.
  • In the mapreduce jobs launched by Hive in the YARN cluster: we set the Hadoop configuration setting mapreduce.job.user.classpath.first to true so that Reactor jars contained in the hive.aux.jars configuration would come earlier than Hive jars in the classpath of those mapreduce jobs.
  • In the local mapreduce jobs launched in the same YARN container as the Hive service: we changed the way Hadoop classpath was set by modifying the HADOOP_CLASSPATH environment variable. We also set the environment variable HADOOP_USER_CLASSPATH_FIRST to true so that the HADOOP_CLASSPATH content would come earlier than Hive jars in the classpath of the local mapreduce jobs.

Future work

We are working hard to enable a more fun and productive user experience for data exploration in Hadoop. One of the features that we plan on introducing in the future is a JDBC driver to connect third-party BI tools to Reactor Datasets. This will make our platform even more accessible to all users who want to work with data in Hadoop.

To try out our latest new features including Ad-hoc SQL queries, download the Continuuity Reactor 2.3 SDK and check out the developer documentation to get started.

And if this work sounds exciting to you, check out our careers page and send us your resume!

<< Return to Cask Blog