For those of you following our blog, by now, you should have an idea of what kinds of questions can be answered through data analytics. In last month’s blog post, we outlined some of the basic forms of analysis that we perform after we’ve Featurized a Cleaned and Linked dataset into a set of indicators. This month, we’ll be taking a deeper look at the concept of Predictive Modeling.
Predictive Modeling in General
Predictive modeling is the process of developing a function that can accurately generate an outcome probability estimate, based on some input data. When we’re working with large quantities of data, commonly known as “Big Data,” this is very difficult for a human to do. Instead, we use advanced techniques developed by statisticians, mathematicians, and Artificial Intelligence (AI) specialists to create computer models that search for patterns in large batches of information. These methods are called Machine Learning algorithms. Tom Mitchell, one of the Titans of this field, eloquently sums up Machine Learning as “[The process by which] A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” . In essence, we write programs to improve a computer’s decision-making as it sees more data.
Machine Learning specialists have spent decades working on this and many other problems. Machine Learning is broadly separated into two classes of problems – supervised and unsupervised. In supervised learning, we train machines to predict a specific indicator (i.e. ‘will I play tennis today?’), whereas unsupervised learning focuses on identifying previously unknown groupings within data. Within this post, we’ll focus on supervised learning, which is, in turn, broken into two large categories – classification and regression.
Supervised Machine Learning & Playing Tennis
The goal of both classification and regression is to take an indicator vector X and use that data to predict a target indicator Y. Mathematically stated, we’d like to find a function f such that f(X)->Y optimizes some performance metric. It may help to consider some example data. Take the canonical PlayTennis dataset:
As you can see, PlayTennis is a categorical indicator – all values are ‘yes’ or ‘no’. The nature of the target indicator determines whether we use Classification or Regression. Were the target continuous, perhaps on the range from 0 to 100, we would instead turn to Regression to build our model. Within the context of Classification, we’d like to use our measurements for Outlook, Temperature, Humidity, and Wind to predict the indicator PlayTennis. A popular method to achieve this is to use a Decision Tree. This algorithm divides and conquers within the available data to produce an intuitive model. If we were to plug the above data into a Decision Tree learning algorithm, we would get an output that looks something like this:
Predicting & Playing Tennis
Based on our observations, we can render a prediction that I am unlikely to play tennis every time it is Sunny with High humidity. Likewise, if the day is Overcast, I will likely play a match. You may notice that temperature, which was listed in the dataset, is not included in this Decision Tree model. This illustrates the idea that sometimes not every piece of data is necessary to make an accurate prediction. In this case, the indicator, temperature, is not statistically significant.
While this is a very simplistic example, it gives a basic overview of how a predictive model works. With larger datasets and more complex problems, there are many challenges with a myriad of possible solutions. However, this example illustrates the power of Machine Learning to produce insightful predictive models.
 Mitchell, T. (1997). Machine Learning, McGraw Hill.