Stream Views in CDAP

Alvin Wang

Alvin Wang is a software engineer at Cask where he is building software fueling the next generation of Big Data applications. Prior to Cask, Alvin developed real-time processing systems at Electronic Arts and completed engineering internships at Facebook, AppliedMicro, and Cisco Systems.

Alvin Wang

In a previous blog post, we outlined how schema-on-read works with streams. Schema-on-read features allows users to decouple data ingestion from exploration. In this post, we will see how users can attach multiple views on the same stream using a feature called stream views.

Stream views provide a way to read from the same stream using different schemas. This is useful when you want to interpret the same raw data in multiple ways. For example:

  1. An admin wants to expose different parts of the same data to different stakeholders
  2. An admin wants to expose different views of the same data to different stakeholders
  3. A stakeholder wants to only consume a small part of a data stream for performance reasons, to avoid parsing entire rows of data

Streams

To explain stream views, we will first provide some background about streams. Streams are the primary means of bringing data from external systems into CDAP in real-time. They are ordered and time-partitioned sequences of data, usable for both real-time and batch collection and consumption of data.

A stream can be associated with a schema which will create an external Hive table that is associated with the stream. Thereafter, when exploring a stream that has a schema, you will see the stream events interpreted into more useful, schema-based rows.

Stream Views

A stream view is a read-only, schema-based, view of a stream. This feature is intended to replace the single schema associated with a stream, effectively allowing for multiple schemas to be associated with a single stream. When a stream view is created, CDAP creates a Hive table that reads from the associated stream and uses the schema belonging to the stream view.

Stream Views Without CDAP

To support stream views without CDAP, assuming that a system similar to CDAP streams is already implemented, you would first need to implement a schema system that supports Avro, Grok, CSV, and TSV formats. Then you would need to implement and run a service that manages stream views. The service would need to manage and store stream view metadata, including the associated stream and schema. Upon creating or modifying a stream view, there would need to be code in your service that creates a Hive table that supports reading from a stream using a specific schema. This would mean creating a custom implementation of Hive’s DefaultStorageHandler that supports your stream implementation.

After you’re done implementing the stream view service, you would likely need some way to run multiple instances of your service, provide a way to manage the lifecycle of your service, and expose your service logs through a common endpoint.

Implementation

To implement stream views, we started with a store for stream views, which would be used to store metadata about stream views, and an implementation that uses a system dataset:

public interface ViewStore {
  boolean createOrUpdate(Id.Stream.View viewId, ViewSpecification config);
  boolean exists(Id.Stream.View viewId);
  void delete(Id.Stream.View viewId) throws NotFoundException;
  List<Id.Stream.View> list(Id.Stream streamId);
  ViewDetail get(Id.Stream.View viewId) throws NotFoundException;
}
 
public final class MDSViewStore implements ViewStore {
  ..
}

I then created an HTTP handler and a view admin for managing stream views:

@Path(Constants.Gateway.API_VERSION_3 + "/namespaces/{namespace}")
public class StreamViewHttpHandler extends AbstractHttpHandler {
 
  @PUT
  @Path("/streams/{stream}/views/{view}")
  public void createOrUpdate() {
    ..
  }
 
  @DELETE
  @Path("/streams/{stream}/views/{view}")
  public void delete() {
    ..
  }
 
  @GET
  @Path("/streams/{stream}/views")
  public void list() {
    ..
  }
 
  @GET
  @Path("/streams/{stream}/views/{view}")
  public void get() {
    ..
  }
}
 
public class ViewAdmin {
  private final ViewStore store;
 
  public boolean createOrUpdate(Id.Stream.View viewId, ViewSpecification spec) {
    ..
  }
 
  public void delete(Id.Stream.View viewId) {
    ..
  }
 
  public List<Id.Stream.View> list(Id.Stream streamId) {
    ..
  }
 
  public ViewSpecification get(Id.Stream.View viewId) throws NotFoundException {
    ..
  }
}

In the implementation of ViewAdmin.createOrUpdate(), in addition to recording the metadata of the stream view, we make a request to one of our internal HTTP services to create or recreate the Hive table associated with the stream view.

Future of Streams and Stream Views

In future work, we plan to focus on (1) allowing flows and MapReduce programs to read structured data from a stream or stream view based on the associated schema; (2) supporting user-defined plugins for stream formats; and (3) exposing more complex features targeted toward stream views, such as joining multiple columns into one, or splitting one column into multiple.

Summary

You have now learned how stream views work: They enable reading raw stream data with multiple schema interpretations and are used to provide focused views on the same raw data.

Learn more about CDAP: http://cdap.io/

 

<< Return to Cask Blog