post

Data Science in Five Steps

Special Agent OsoWhen my daughter Palamee was younger she watched a cartoon with a character named Special Agent Oso. Oso would complete his missions using three simple steps. In a recent conversation I was asked to provide my definition of data science. Today I'm going to provide that definition in not three easy steps, but five, and show a real-world implementation.

Data Science is a Process

It's easy to get caught up in specific aspects of data science such as data engineering or analysis. These are parts of data science, but they aren't the entire thing.

Data science is really a five-step process for creating solutions to problems. Those steps are:

  1. Define the Problem
  2. Gather and Review Available Data
  3. Form a Hypothesis
  4. Test Your Hypothesis
  5. Visualize and Report

That looks a lot like a compact version of the six steps of the scientific method:

  1. Ask a Question
  2. Do Background Research
  3. Construct a Hypothesis
  4. Test Your Hypothesis by Doing an Experiment
  5. Analyze Your Data and Draw a Conclusion
  6. Communicate Your Results

So really, data science is an application of the scientific method.

Let's look at each of the steps.

Step 1: Define the Problem

Ask the Right Question

Everything starts with the problem you're trying to solve. I will admit I've been guilty of starting with data, performing exploratory analysis, and then attempting to figure out what questions I can answer.

That was always a non-starter. The process must be driven by an existing problem, and what is a problem if not a question waiting to be answered?

An example from my work is this:

Problem: we need to ensure the claimant information people put into our system is correct.

Question: how do we accurately verify the identity of the claimant information entered into our system?

With a well defined question, we can begin to see a way of solving our problem.

Step 2: Gather and Review Available Data

Once you have restated your problem as a question, the next step is to gather data you believe will help lead to an answer. You may have this data on hand or it may need to be acquired using a method like web scraping. Regardless, gather your data.

In my case, I had data from three sources:

  • Our claim processing system – contained the claimant records I needed to verify.
  • A background check service – contained information I could use to compare against the information in our claim processing system.
  • Previously labeled data – a set of 55,000 records which had been run through the background check service and marked, by a person, as verified or not.

Knowing which data is relevant to your question is part art and part science; sometimes trial, sometimes error.

With data in hand you can further define your potential solution, and move to step three.

Step 3: Form a Hypothesis

A hypothesis is your best guess as to what the outcome will be. This can be written as a simple “if this then this” statement, or something more complex. My hypothesis was a three-parter:

  1. I can use a combination of string comparison algorithms and a predictive model to accurately determine whether or not a given claimant's information matches what the background check service provides for that claimant.
  2. A combination of name and social security number will be the determining factor for whether or not a record is considered verified.
  3. I can use a predictive model to significantly reduce the amount of time spent manually verifying what the background check service returns.

 

With our hypothesis in hand we move to the step everyone loves to get to first – testing.

Step 4: Test Your Hypothesis

This is where the rubber meets the road my friend. If you've properly defined your problem as a question, gathered the correct data, and created a well-formed hypothesis, you can create a solution to produce an answer. Regardless you could produce an answer, but if you want it to be right or close to right, you need your ducks in a row first.

Your solution may include data exploration, data engineering tasks such as setting up infrastructure and importing data into databases, data analysis tasks such as creating a predictive model, and more. Essentially you need to do all the things in order to get your answer.

My solution consisted of:

  • Measurements of the time required to perform the process before the model implementation and after
  • Python scripts to create and train a predictive model
  • A data science tool (Data Science Studio) to create a predictive model
  • SQL queries to pull data from our system and upload it to the background check service
  • Python scripts to analyze the data returned
  • An ETL tool (Pentaho Data Integration) to save the results of the analysis back to our production database

Those six bullet points hide the complexity. There was a lot of trial and error as I created and tweaked my scoring algorithm and the predictive models. It took a few weeks to arrive at a solution that worked well enough to put into production.

Once I had my answer, I had to show it to management to get their sign off. That's where step five comes in.

Step 5: Visualize and Report

This step can take many forms: charty goodness, a multi-page PDF, a presentation or any other form you choose to communicate your results. Choose your reporting method based on your audience. Simple charts in presentations are excellent when reporting to management. Sharing code with more technical audiences can be highly effective. A downloadable PDF is a great marketing tool and can contain all of the above.

I wanted to show the benefits gained using a predictive model:

  1. Improved record comparison accuracy
  2. Reduced time spent manually verifying records
  3. Ability to scale the process

To do so I created some charty goodness showing an overview of how the process was implemented in production and the before and after measurements.

Was My Hypothesis Correct?

My hypothesis turned out to be (almost completely) correct, and we gained significant improvements in all important areas. To review, here's my initial hypothesis:

  1. I can use a combination of string comparison algorithms and a predictive model to accurately determine whether or not a given claimant's information matches what the background check service provides for that claimant.
  2. A combination of name and social security number will be the determining factor for whether or not a record is considered verified.
  3. I can use a predictive model to significantly reduce the amount of time spent manually verifying what the background check service returns.

Here are the results:

  1. I was able to successfully use string comparison algorithms and a predictive model to accurately verify a claimant's information.
  2. The model considered date of birth and one of the name combinations I produced as the most relevant factors for verifying a record, so I was wrong on this point.
  3. By using the predictive model we increased automatic record verification by ~80% and reduced visual inspections required by ~90%.

Life is good.

Your Turn

What is your definition of data science? Does it match up with mine?

Comments

  1. Great article! Perfect explanation of the science.

Leave a Reply

%d bloggers like this: