Over the past few months I started hearing the term “data pipeline” more and more at the local data meetups. Curious as to just what that meant, I looked it up. In this post I'm going to tell you what I found, and more importantly provide real-world examples of data pipelines you can use for your data projects.
Data Pipeline, The “For Dummies” Version
In brief, a data pipeline is a series of steps that your data moves through. The output of one step in the process becomes the input of the next. Data, typically raw data, goes in one side, goes through a series of steps, and then pops out the other end ready for use or already analyzed.
The steps of a data pipeline can include cleaning, transforming, merging, modeling and more, in any combination.
Easy to set up? As always, that depends. As promised, let's look at a few examples.
Example 1: Amazon Web Services (AWS)
The AWS Data Pipeline is “a web service that you can use to automate the movement and transformation of data.” In essence you do three things:
- Define data nodes:
- Schedule compute activities
- Activate & monitor your set up
Let's look at a bit more detail for each of our steps.
AWS Data Pipeline (AWS DP) works with with data stored in S3, DynamoDB, Redshift, RDS, and JDBC sources (relational databases sitting on EC2 instances). As we can see in the above example, it also works with files stored on EC2 instances.
Compute activities are the work that's performed on the data. You can use one or more of the ones Amazon has built-in, or create your own custom ones. A few of the built-in ones are:
- CopyActivity: as it sounds, it copies data from one location to another
- HiveActivity: runs a Hive query on an Amazon EMR cluster
- ShellCommandActivity: runs a custom UNIX/Linux shell command
Activation & Monitoring
Once you activate your pipeline it's pretty hands-off until something goes wrong. When that happens, whether it's because of a dependency failure or user cancellation, you'll get an alert so you can handle it.
A simple example for archiving and analyzing web logs looks like this:
Every day at a scheduled time, an activity copies log files from a web server running on an Amazon EC2 instance to S3. On a weekly basis, another activity kicks off and performs an analysis of the log files using Amazon EMR(Elastic MapReduce).
Will It Work For You?
Regardless of whether or not you're using Amazon Web Services exclusively, AWS DP is a compelling solution if you it fits with your technology stack.
Example 2: DIY
This is my domain – setting up a custom data pipeline for a specific use case. At work, we have something that looks like this:
First we import data from files, mainly spreadsheets, into the database while simultaneously cleaning and transforming it. Next we run our custom analysis algorithms which perform a bulk of the work. Because some of the analysis requires manual verification, we produce a report with everything we need a person to look at. Finally we import the reviewed files into the database, re-run our analysis, and produce our report.
There are a number of moving pieces in this set up:
- A few databases
- R for importing, cleaning and transforming
- SQL and Python for analysis and some additional non-automated cleaning
- R for producing the report
Everything currently runs in house.
This set up is what I call semi-automated. The reason we are unable to fully automate it is that we don't know what format data will come in each week, and it constantly changes. Having said that, we're working with our clients to create standards. Once that happens, except for the manual review(s) we need to perform – and there will always be some manual review – the rest will be fully automated.
New Terminology For An Established Concept
The simplest way to look at a data pipeline is to understand that it's a new term for something all of us have been doing for a long time – implementing custom workflows. In this instance, however, it's a workflow for working with data.
Are you using a data pipeline in your projects, and if so, what's it look like?
Photo courtesy of Maureen