Cask Data Application Platform (CDAP) simplifies Big Data application development by abstracting many of Hadoop’s complexities and enabling developers to use familiar skills. We found that one of the best ways to simplify distributed programs is to have exactly-once processing semantics. Having exactly-once processing makes it easy to reason about the state of the system and simplifies application development. Developers can concentrate on writing code that matters to their use case rather than worrying about distributed coordination.
CDAP uses Apache HBase™ extensively as the default storage engine for storing application data and metadata. While HBase is massively scalable and highly concurrent, it sacrifices strong consistency guarantees to achieve performance. HBase offers strong consistency for updates on a single row or for batch updates within a single region. However, implementing exactly-once semantics on HBase without consistency guarantees for data updates across regions, across tables or across RPC calls is highly challenging and error prone. Apache Tephra fills this gap by providing ACID transactions on HBase.
Apache Tephra is a transaction engine for Apache HBase and other distributed stores that supports multi-versioning and rollback support. Tephra implements transactions using Snapshot Isolation by using HBase’s native data versioning to provide multi-versioned concurrency control (MVCC) for transactional reads and writes. With this MVCC capability, each transaction sees its own consistent “snapshot” of data, providing snapshot isolation of concurrent transactions. MVCC along with conflict detection and handling enables Optimistic Concurrency Control.
Early on we realized that Tephra has applications beyond core CDAP and decided to make Tephra an open source project under the Apache 2.0 licence. It was only natural to move Tephra under the Apache Software Foundation (ASF) as we saw adoption from projects like Apache Phoenix. Being part of the Apache ecosystem will help in building a diverse developer community. It will also make it easier for other Apache projects to adopt Tephra.
Tephra is being used in production in many companies today as part of CDAP, and we are confident it will see wider adoption as part of Apache Phoenix. We are working on the next set of features that will make Tephra more scalable and also improve its operational aspects. If this has piqued your interest, and you wish to use or contribute to Tephra, then here are some resources to get you started on Tephra:
- Tephra GitHub repository contains both documentation and the source code.
- Subscribe to Tephra mailing list. The list is archived here.
- A talk by James Taylor on integrating Apache Phoenix with Tephra.
- A few blog posts about Tephra.
If you don’t find all the information you are looking for, send an email to the mailing list and we will be happy to help you get started. We always welcome new contributors.
We will be at HBaseCon on May 24th, 2016, where you can find us at the Cask booth (#E3). Come talk to us if you want more information on Tephra, or on Big Data application development in general. We also have a short talk on Tephra at HBaseCon as part of a broader talk on Apache Phoenix – Apache Phoenix: Use Cases and New Features. And in addition to this, we will also be talking at the first PhoenixCon on May 25th, 2016 in San Francisco, where we will get into more details on the internals of Tephra. We hope to see you at both the events!