Building a Data Lake on Google Cloud Platform with CDAP

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company's long-term technology, driving company engineering initiatives and collaboration. 

Prior to Cask, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E.

It is no secret that traditional platforms for data analysis, like data warehouses, are difficult and expensive to scale, to meet the current data demands for storage and compute. And purpose-built platforms designed to process big data often require significant up-front and on-going investment if deployed on-premise. Alternatively, cloud computing is the perfect vehicle to scale and accommodate such large volumes of data in an economical way. While the economics are right, enterprises migrating their on-premises data warehouses or building a new warehouse or data lake in the cloud face many challenges along the way. These range from architecting network, securing critical data, having the right skill sets to work with the chosen cloud technologies, to figuring out the right set of tools and technologies to create operational workflows to load, transform and blend data.

Technology is nothing. What’s important is that you have a faith in people, that they’re basically good and smart, and if you give them right tools, they’ll do wonderful things with them. — Steve Jobs

Businesses are more dependent on data than ever before. Having the right toolsets empowering the right people makes data readily available for better and faster decision-making.  Choosing the right toolsets that make integration simple and allow them to focus on solving their business challenges rather than focusing on infrastructure and technology is one of the most important steps in migrating a data warehouse to the cloud or building one in the cloud.

In this blog post, we will talk about how CDAP (Cask Data Application Platform) seamlessly integrates with Google Cloud Platform (GCP) technologies to build a data lake in the cloud. We will look at how CDAP helps data management professionals to maximize the value of their investments in GCP by integrating more data using CDAP to achieve their business objective of migrating or building data lake on GCP.

The CDAP Pipeline (Workflows) is a data orchestration capability that moves, transforms, blends and enriches data. CDAP Pipelines manage the scheduling, orchestration, and monitoring of all pipeline activities, as well as handle failure scenarios. CDAP Pipelines offer a collection of hundreds of pre-built connectors, simplified stream processing on top of open-source streaming engines, as well as new out-of-the-box connectivity to BigTable, BigQuery, Google Cloud Storage, Google PubSub and other GCP technologies. Thus, they enable users to integrate nearly any data, anywhere in a Google Cloud environment.

Governance is an important requirement of any data lake or data warehouse, whether it is deployed on-premises or in the cloud. The ability to automatically capture and index technical, business and operational metadata for any pipelines built within CDAP makes it easy to discover datasets, perform impact analysis, trace the lineage of a dataset, and create audit trails.

So, let’s look at some of the capabilities recently added in CDAP to integrate with Google Cloud Platform technologies.

Data Prep and Pipeline Integration with Google Cloud Storage

At Cask, we believe having seamless workflows that optimize macro user-flows provide a complete and fun experience when working with complex technologies. It also allows them to focus on their business use cases rather than on infrastructure. We have observed first hand with our customers that in doing so they achieve higher efficiencies, reduced operating cost, less user frustration and ultimately the democratization of access to data, which leads to greater value from the data faster. In the spirit of achieving higher efficiency, we decided to first integrate CDAP’s Data Prep capability with Google Cloud Storage.

Google Cloud Storage (GCS) is unified object storage that supports a wide variety of unstructured data in the areas of content distribution, backup and archiving, disaster recovery, and big data analytics, among others. You can use CDAP Pipelines to move and synchronize data into and out of GCS for analytics, applications, and a broad range of use cases. With CDAP, you can quickly and reliably streamline workflows and operations or use the same flow to move your customer or vendor data from Amazon S3, Azure ADLS or WASB into Google Cloud Storage.

CDAP Pipelines provide plugins for integrating with GCS natively, irrespective of whether you are working with structured or unstructured data. They also provide seamless integration with CDAP Data Prep capabilities and make it easy to create a GCS connection to your project, browse GCS, and immediately wrangle your data without having to use code or move to another console.

Watch the screencast below to understand the flow of integration with respect to CDAP Data Prep and CDAP Pipelines and GCS.

 

gcs-connection-mp4-screenshot


This flow from the start (Configuring GCS) to finish (Pipeline Deployed) takes around ~ 2 minutes to build, and not a single line of code was written
.

In addition to integration with CDAP Data Prep, the following CDAP plugins are available to work with GCS:

  • GCS Text File Source – A source plugin that allows users to read plain text files stored on GCS. Files can be CSV, tab delimited, line separated JSON, fixed length, etc. 
  • GCS Binary File Source – A source plugin that allows users to read files as blobs stored on GCS. Files like XML, AVRO, Protobuf, Image, and Audio files can be read.

CDAP Data Prep automatically determines the file type and uses the right source depending on the file extension and the content type of the file. Below is a simple pipeline and configuration associated with GCS Text File Source for your reference.

GCS-Pipeline

Google BigQuery Integration

Another important component of the Google Cloud Platform is Google BigQuery. Google BigQuery is a serverless, fully-managed petabyte-scale data warehouse which empowers enterprises to execute all their data warehousing operations in a highly concurrent manner. With CDAP’s native Google BigQuery connector, Spark, Spark Streaming, and MapReduce jobs can be used to load massive amounts of data into BigQuery rapidly. CDAP’s support for nested schemas and complex schemas allows diverse data types to be analyzed in BigQuery efficiently. The schemas of the dataset tables are seamlessly made available to users while configuring the plugins. New tables within datasets can be created without additional effort.

BigQuery-Pipeline

The above pipeline reads the New York Trips Dataset (available as a public dataset on Google BigQuery), performs some transformations and calculations on the cluster, and writes the results back into Google BigQuery. This example might not be highly relevant to a real use case since you could use BigQuery SQL to do what is being done here, but this pipeline is for demonstration purposes only, to show that sources and sinks for Google BigQuery are available to read from and write to.

These BigQuery plugins provide simplicity in terms of importing metadata from BigQuery and automatically creating tables along with right schema based on pipeline schema.

Google PubSub Integration

Google PubSub is a fully-managed real-time messaging service that lets you ingest data from sensors, logs, and clickstreams into your data lake. CDAP’s support for Spark Streaming, Kafka, MQTT, and native connectivity to Google PubSub makes it easy to combine historical data with real-time data, for a complete 360-degree view of your customers. It also makes it easy to move the data between on-premises and the cloud.

Following is a simple real-time CDAP Data Pipeline used for pushing data up to Google Cloud Platform PubSub from on-premises Kafka in real time. The data published is readily and immediately available to be consumed for further transformation and processing.

 

Google-Publisher-from-Kafka

Use cases

EDW Offload | Oracle CDC to Google BigTable

Over the past decades, enterprises have installed appliances and other pre-configured hardware for data warehousing. The goal for these solutions, which often required heavy investments in proprietary technology, was to make it easier to manage and analyze data. However, recent advancements in open source technology that provide less expensive ways for storing and processing massive amounts of data have broken down the enterprise walls, allowing enterprises to question the cost of expensive hardware.This time, instead of replacing legacy systems with new hardware, enterprises are looking to move to the cloud to build their data lakes when it makes sense for them. But, the right tooling is needed to support the many possible use cases of a data warehouse in the cloud. Four things are needed to efficiently and reliably offload data from an on-premises data warehouse to the cloud:

  • Ease of loading data and of keeping it updated,
    • Ability to one-time migrate all data from a warehouse to the cloud
    • Continually keep data between an on-premises warehouse and a cloud warehouse in sync (lift and shift is not always viable, as there could be applications that still rely on an on-prem warehouse to be present)
  • Query tools that support faster queries on Small as well as on Large Dataset,
  • Support for High Concurrency without degradation in performance, and
  • A custom reporting and dashboard tool.

Google BigTable in combination with Google BigQuery provides the ability to support bulk loads, and upserts along with the ability to query the data loaded at scale. For reporting and dashboards, Google Data Studio or any other popular BI tools can be used in combination with Google Query to satisfy many of the reporting needs.

Now, the main problem is how an enterprise can efficiently offload data from their on-premise warehouses into BigTable and keep the data in BigTable in sync. To support the EDW Offload to BigTable use case, CDAP provides capabilities to perform Change Data Capture (CDC) on relational databases and data pipelines, and plugins for consuming the change data events and updating the corresponding Google BigTable instance to keep the data in sync. The change data capture solutions can use one of three approaches for capturing changes in the source databases:

  1. Transactional Log in the source table via Oracle Golden Gate
  2. Querying REDO logs via Oracle Log Miner or
  3. Using change tracking to track the changes for SQL Server

The first solution reads the database transactional logs and publishes all the DDL and DML operations into Kafka or Google PubSub. The real-time CDAP Data Pipeline consumes these changesets from Kafka or Google PubSub, normalizes and performs the corresponding operations for inserts, updates, and deletes to BigTable using the CDC BigTable Sink plugin.

 

CDC-BigTable-Architecture

Following is a pipeline that reads the changesets from a streaming source and writes them to BigTable recreating all the table updates and keep them in sync. 

BigTable-CDC-Pipeline

To query from BigQuery, add the tables as an external table. More information on how-to is available here.

BigTable-BigQuery-SQL

CDAP CDC is a proprietary add-on. Please contact us if you are interested in evaluating.

Moving between Clouds | Amazon to Google and vice-versa

There are multiple reasons why an enterprise might decide to migrate from one public cloud platform to another or to choose more than one cloud provider. One reason might be that a different public cloud provider offers better pricing than the current provider or a better match in terms of services offered. Another common case is an enterprise recently went through a merger, and the acquirer already has a preference for their public cloud provider. Regardless of the reasons, one way to ease migration or support more than one cloud is to start with a multi-cloud data management platform that integrates with cloud environments. By using a multi-cloud data management solution, such as CDAP, you can seamlessly create an abstraction that hides the underlying cloud differences and allows simple migration of workflows and data. Adopting such a platform from the get-go is extremely valuable in a hybrid cloud environment, where you may be managing on-premises, (hosted) private as well as public clouds.

Building workflows that can efficiently and reliably migrate data from one public cloud store to another are simple with CDAP Pipelines. Following is an example that shows how data from Amazon S3 can be migrated into GCS and, during the process, can be transformed and stored in Google BigQuery.

Amazon-S3-to-GCS-And-BigQuery

After the pipeline is executed the results of execution of the pipeline are available on GCS and as well as within BigQuery.

S3-to-BigQuery-Transactions

AI Integration | Translating audio files using Google Speech Translator

Transcription is the best way to convert your recorded audio into highly accurate, searchable and readable text; being able to index and search through audio content is useful because it helps your users find relevant content. It can be used to boost organic traffic, improve accessibility, and also enhance your AI by transcribing audio files to provide better service to your customers.

Let’s say you have a company that offers customer support services and you are recording random customer conversations to improve the quality of service in order to get better insights into how representatives handle calls. The first step as part of improving the service is to transcribe the recorded audio files into digitized readable text. Further, the text can go through various AI / ML workflows to determine the mood of the call, customer sentiment, resolution latency, and more.

Google Cloud Speech API uses powerful neural network models to convert audio to text. It recognizes over 110 languages and variants, to support your global user base.

Simplifying the transcribing of massive amounts of recorded audio files, Google Cloud Platform technologies and CDAP together provide users with an integrated, scalable and code-free way to transcribe audio files. This integration allows users to build pipelines that can be scheduled and monitored with ease, for any production deployment, in minutes to hours, rather than weeks or months.

Below is a simple CDAP Pipeline that takes the raw audio files stored on Google Cloud Storage, passes it through Google Speech Translator plugin, and writes the transcribed text to another location on Google Cloud Storage.

Google-Speech-Translator-Pipeline

The Google Speech Translator CDAP plugin is ready to go with a minor configuration of settings depending on the types of the files being recorded. The translation applied to the raw audio file in the above example generates a JSON output that describes the file that was transcribed, with the computed confidence for the transcription.

{
  "path": "/audio/raw/audio.raw",
  "speeches": [
    {
      "confidence": 0.9876289963722229,
      "transcript": "how old is the Brooklyn Bridge"
    }
  ]
}

Conclusion

Based on our own experience using GCP for cloud-based software development, and feedback from customers and prospects, we believe Google Cloud Platform (GCP) offers tremendous value when it comes to reliability, scalability, and cost. Combining the large-scale infrastructure provided by GCP with Cask’s big data integration solutions allows you to focus more on your data and use cases rather than on infrastructure and technology while providing the operational support needed to efficiently manage data analytics projects on GCP.

Our current level of integration with GCP is just a start; future integrations will be focused around integrations with Stackdriver for logs and metrics, integration with Google Container Engine, integration with Apache Beam and much more.

If you are using or evaluating GCP and are looking for ways to improve your integration on GCP, you may want to try out CDAP on Dataproc and install GCP connectors from Cask Market today. If you are specifically looking for Data Prep integration with GCP, please check back soon, as it will be available in the upcoming version of CDAP v4.3.2, available later this month.

<< Return to Cask Blog