There are a few age-old questions that businesses have been trying to answer for centuries:

“How can we reach more customers?”

“How do we keep the customers we already have?”

“How do we convince our customers to do more business with us?”

Lucky you, living in a time where “Big Data” promises big answers to these big questions! You think to yourself, “I have all this data lying around, I bet I can just send these tables to a data scientist and they can make some models; then we’ll have all of our answers!” Unfortunately, it’s not quite that simple. The key feature that makes predictive analytics possible is the large volume of data that can be used to create a predictive model. But with large amounts of data, there is an essential step on the front-end, before any modeling can even take place; it’s what we like to call “Cleaning & Linking.”



Before we begin any form of data science work, we go through a process of data cleaning. During this process, the primary goal is to identify data values that are suspect, out of the ordinary, and potentially incorrect. We refer to these types of data values as “flagged” data values. Here is an example of what we might see in an un-cleaned data set:

Table 1

Unique ID Name City State Birth Date
123 Les Brown Atlanta Georgia 05-03-1962
456 Mary Smith Pittsburg PA 06-15-198×2
789 John Peters Pittsburgh Penn 06-15-1982

If you notice, there are a few inconsistencies or abnormalities in Table 1. Under the City field, the city ‘Pittsburgh’ is misspelled on one occasion. If you were to sort every person based on their city, Mary would not be counted among the other people who live in Pittsburgh because, to a computer, ‘Pittsburgh’ and ‘Pittsburg’ are not the same place.

Likewise, in the State column, there are three different ways to indicate a state’s name. There is the full name spelled out, the postal abbreviation, and an unofficial state abbreviation. The Birth Date column also seems to have a stray character ‘x’ in Mary Smith’s birth date. In order to compare values, there must be a consistent format for all data. The solution to this problem is Data Cleaning.


How exactly do we do this?

We have a series of summarization tools that look deep into each field of data. These tools predict the type of data in a given field, display categorical information and basic statistics, and even identify some data values that don’t seem to fit in with the rest of the data in that field. With this information, we define a cleaning schema for each data source. This process is very streamlined, because we develop and reuse cleaners for common data types that we encounter. However, very customized and specific cleaning rules can also be defined as the data requires.

After establishing a cleaning schema for a data source, we can run each record of that data source against the schema to identify the existence of any and all flagged data values. After discussing specific types of flagged data values with a client, we can ensure that only the most reliable and accurate data is passed through to future data science initiatives. By doing so, we also ensure, with great confidence, that our data science insights are accurate and reliable.

To use the example above, for the State field, our tool will go through each cell in the column. If the cleaning schema for this field is to use the official postal abbreviation, the script will cycle through each cell and check to see if it is a valid US postal abbreviation. If it is not, our cleaner utilities will change the data in the cell to the appropriate postal abbreviation by replacing ‘Penn’ with ‘PA’ and ‘Georgia’ with ‘GA’. Additionally, all altered data values will be flagged to acknowledge that a change took place. The same process can be scaled to incorporate a large number of data sources for any sort of analytical investigation.


The Goal: An ‘Atom-Centric’ View

Let’s think about the three most common questions a company wants to answer using predictive analytics:

  1. How do we retain existing customers who are most likely to churn? (Retention)
  2. How do we identify the customers that would be interested in purchasing additional products/services that our company offers? (Upsell)
  3. How do we identify people in the general population who would be our ideal customers? (New Acquisition)

In answering these questions, the focus of the analysis is an individual—an existing customer or a potential customer. However, in some cases, like a cable company or a cell-phone service provider, an individual account or entire household is the focal point of an analysis. The common thread is the interest in an individual unit—the smallest, indivisible entity. We call this entity the ‘atom,’ because, in abstract terms, it is the smallest complete unit of data, and the unit of interest. To answer the three questions above, we want to look at each piece of data in relation to the ‘atom’ to which it belongs. This leads to the next important step in the process of getting the data ready for predictive modeling—linking.



In any sort of data science analysis, multiple data sets exist. After the data sets are aggregated from disparate sources and cleaned, it’s time to concentrate on obtaining an atom-centric view. With an understanding of the relationship between data fields across multiple data sets, linking rules can be defined to aggregate and structure all of the atom’s data from every data source within a single row of data. The ultimate goal of the linking process is to develop a singular, ‘atom-centric,’ cleaned & linked dataset for further analysis.

However, this reduction step is far from a trivial task. Relational databases are typically designed to be as modular as possible. In databases, modularity allows for optimal efficiency, maintainability, and scalability. While this modular approach is optimal for architectural design and performance, it does not easily lend itself to data science initiatives where you want to consider all data from multiple disparate data sources simultaneously for analysis. Additionally, databases could contain different types of data. Some could contain descriptive data like name or phone number while others could contain transactional data like billing records or website activity logs. Creating this aggregated representation of all types of data around an entity is necessary to fully understand the entity of interest.

Once this process is complete, we can pass a single, cleaned, ‘atom-centric’ dataset to the Data Scientists and let them do their thing. At the end of the day, it is about having access to all of the data necessary to understand the actions, behaviors, habits, and trends of the ‘atom’ so we can better understand the individual as a whole. With that knowledge, we can create insights to drive changes in those actions, behaviors, habits, and trends, and ultimately, provide answers to a company’s most challenging questions.