AeroCask – Real-time Flight Data Analytics using CDAP

Gokul Gunaskeran is a software engineer at Cask where he is building software to enable the next generation of data applications. Prior to Cask, he worked on Architecture Performance and Workload Analysis at Oracle.

One of the many things that I love about Cask are the hackathons before every release. It is not only a way for us to dog-food new features in the CDAP platform but it is also an opportunity to let your imagination run loose and implement an integration with another system; or develop an interesting application using the platform; or even a build product feature that is missing which would greatly improve the CDAP user experience or the developer’s productivity at Cask.

So for the hackathon before CDAP’s release of v3.1, my team implemented a project, AeroCask. I learnt quite a few things during this 2-day long project (some work during the weekend as well). CDAP, in short, is a platform for developing applications for the Hadoop ecosystem (similar to Weblogic for Java EE apps).

Most of the aircrafts in the US and around the world are equipped with an ADS-B transponder. This is a way by which aircrafts communicate their identification, speed, altitude, latitude/longitude, direction they are headed towards etc. This data can be received and processed using a simple ADS-B Receiver (for example, NooElec ADS-B Receiver Set) connected to a Raspberry Pi running dump1090.

Data from dump1090
Data from dump1090

Since it looked like an easy source of real-time data (real data!), we were thinking about the inferences that we can possibly make from that data. One simple and straightforward thing to do would be geographical tracking of flights. For simplicity, we decided to track latitude and longitude positions of aircrafts over fixed time intervals. We can bucket the latitude/longitude and store it over fixed time periods. I was also curious to try out some graphical analysis. For example, say if I know the source and destination airports of the aircraft, I can pose graphical questions like shortest hops between two airports, airport with max departing flights (outgoing edges).

I ordered the hardware components and received them overnight (component list and assembly instructions are available here). Meanwhile, we got started with the application development. We decided that we will write a script to run on the Raspberry PI that will hit a stream HTTP endpoint to push the data (streams provide a way to ingest data into your application -> HTTP interface over HDFS storage). The data from the Pi was going to be in XML format and so we quickly wrote a XML parser Flowlet (whose source was the stream). Flowlet is a node in a DAG of stream processing Flow. The parsed XML was converted to a POJO and it was sent to two different Flowlets – GeoWriter, Neo4jWriter.

The GeoWriter Flowlet was writing the data to a GeoCounts dataset. GeoCounts, a custom CDAP Dataset (abstraction over data storage in HBase), was used to keep track of latitude/longitude of flights over fixed time intervals (5 min intervals). And the Neo4jWriter Flowlet was using Neo4j-OGM (similar to a RDBMS ORM) to connect to a Neo4j server and was adding relationship between new flights and source/destination airports. We also used a CounterTimeseriesTable, an in-built CDAP dataset, to keep track of flight counts over certain time range (every minute). We were able to develop and unit test the end-to-end application on our laptops even before the actual hardware arrived.

AeroFlow - Flowlet connectivity DAG
AeroFlow – Flowlet connectivity DAG

When the hardware (Raspberry Pi, ADSB receiver) arrived, I assembled it and voila, it worked seamlessly. That is a first for any hardware project that I have worked on! Well to be honest, all the hardware work that I did was connect the receiver to the Pi’s USB and plug in the ethernet, power and SD card (with the PiAware image). But hey, that is some hardware for a big data engineer! When we started looking at the data received on the Pi, I was surprised at the quality of data and the number of flights that it was able to track given that the antenna was kept inside a conference room with no direct sky view.

Raspberry Pi setup to receive ADS-B data
Raspberry Pi setup to receive ADS-B data

Looking at the data, I realized that the information we received from the flights didn’t contain source/destination airport code. We would have to purchase some 3rd party API to get that information. In order to move ahead with the hackathon project, we decided to fabricate (randomly choose from a list) that info (but it should be fairly straight forward to plug in a lookup logic to enrich the data). We also had to write a script that runs on the Pi to filter out duplicate data but at the same time not hold too much data in-memory for comparison. The script was pinging the stream endpoint every second with new flight data.

We also put together a simple UI using Google UI devtools (I was not the frontend guy) to display the location of flights, number of flights over a period of time (every minute) and also the list of destinations and number of flights to each from a given airport (Sankey diagram). One thing I have learnt from the past hackathons is that, no matter what backend magic you conjure, if you want to win the prize, your project better have a fancy UI! Or have something that moves (may be next time 🙂 ). And this is among a group of backend engineers!

UI that shows simple flight analytics
UI that shows simple flight analytics

We got the application running on a 5-node CDH Hadoop cluster on Google Compute Engine in no time (thanks to Coopr!). We deployed the CDAP application which contained the AeroCask flow and a service that implemented the REST endpoints to serve the data (few of the endpoints were using Cypher language to query Neo4j). We got some good flight activity around the time we were demoing (for real-time geo tracking) and we were able to show the analytics that we were able perform over the data (also over the historical data that we had collected! #BigData).

Neo4j dashboard that shows connections established between nodes
Neo4j dashboard that shows connections established between nodes

And finally it was time for vote! CaskDrive, another Hackathon project, was a close contender but we eventually managed to win by one vote (even though Cask Drive team had a much more slick UI)! It was good to work with some useful real real-time data (other than Twitter which has been beaten to death by tons of examples). Have to now go and dig deeper into the ADS-B receiver to figure out how it works and what else I can do with that nifty little thing. If you too would like to build cool big data applications – try out CDAP today.

<< Return to Cask Blog