Cask Tracker Enhanced: Metadata Taxonomy and Data Usage Analytics in CDAP 3.5

Yue Gao and Riwaz Poudyal

Yue Gao and Riwaz Poudyal are summer Interns at Cask. Yue is a master student at University of California, San Diego and Riwaz is a rising junior at Williams College.

Yue Gao and Riwaz Poudyal

Cask Tracker is a self-service CDAP Extension that automatically captures rich metadata and provides users with visibility into how data is flowing into, out of, and within a Data Lake. Tracker was first introduced in CDAP v3.4. Tracker v0.2 has just been released along with CDAP 3.5 and packs a ton of new features.

Dataset Usage Analytics

Once you get more than a handful of datasets in your cluster, it becomes difficult to find the one you’re looking for. Cask Tracker automatically captures rich metadata and provides visibility into how data is flowing into, out of, and within a Data Lake, thereby simplifying data management. With the new 3.5 release, Tracker allows answering important questions like what data is being accessed, what programs are using the data, what is popular on your cluster.  These analytics allow Tracker to show you exactly which datasets are being accessed the most, along with enumerating the applications and programs that are accessing them. The same metrics can be viewed for individual datasets and streams by clicking on the usage tab in the Tracker UI. An additional graph that you may notice on the Usage tab is the Audit Log histogram which allows you to quickly see how active a dataset has been over a period of time based on the number of audit messages received.

image00

Tracker Meter (Beta)

Tracker Meter lets users quickly evaluate a dataset’s popularity by looking at its Tracker Score. Tracker Score is calculated using three important metrics:

  1. How popular a dataset is
  2. How often a dataset is read from
  3. The time since the last read from the dataset

This feature is still in a beta and the algorithm used to calculate the score will be optimized based on inputs from users.

image02

Metadata Management

Tracker has always had the ability to view metadata for your datasets, but new in this release is the ability to edit that metadata. This includes adding and removing user tags as well as updating user properties for a dataset. Now teams can easily update that information without jumping between the CDAP UI and Tracker.

Preferred Tags

Along with adding standard user tags, the new version of Tracker allows users to create Preferred Tags. This special designation only exists inside Tracker and allows teams to standardize their tagging efforts with a common set of terms. Preferred tags have higher priority in searches and tag lists, so users will always see the preferred tag first.

This feature will help promote consistency in tag usage across the CDAP system, enabling easier searching and organization. Users can also conveniently load long lists of preferred tags through the UI, allowing teams to get set-up quickly. This can all be accessed through the Tags menu at the top of the screen.

image01

Dataset Preview

Also new in this release, users can preview the data inside their datasets from Tracker. There is no need to write and execute queries in order to see the records inside a dataset. Up to 500 records will be shown in the Tracker UI, and if you want to perform more complicated queries, you can use the Jump button to quickly navigate to the explore UI inside CDAP.

We think these new features will greatly increase your productivity and allow you to find the information you need quickly and simply. Tracker’s goal is to provide a simple and convenient way to manage and visualize the datasets in your cluster. We are looking forward to adding another set of amazing features in the future updates.

Download the latest CDAP, which is 100% open source, and give these new features a spin. Do reach out to us with any questions and comments you may have!

<< Return to Cask Blog