Understanding machine learning and predictive analytics

By Vedant Misra | April 10, 2016

Welcome to another edition of Demystifying Overused Marketing Terms. The last edition was about Big Data, which you can find here.

This time, we’re talking about “predictive analytics” and “machine learning.” The reason we’re doing this is that just the other day I was walking around the expo floor of a big sales conference in town and overheard the following:

  • Guy A: We don’t do predictive anymore, now it’s all about machine learning.
  • Guy B: Yeah same here. Predictive is fine, but all our data science guys are switching to machine learning now.

If that doesn’t make you facepalm, read on to understand why it should.

Predictive analytics versus machine learning

“Predictive analytics” is the name used in business environments to refer to a set of techniques that are used to find patterns in a dataset and use those patterns to make predictions.

If you stare at a dataset long enough, make some charts in Excel, and come up with the hypothesis that your company could hit next year’s revenue goals more easily if you focus on the biotech industry, you just did predictive analytics. It doesn’t really matter what technique you used; you took quantitative information and make a prediction with it.

But maybe you have a lot of data about your customers, and while in theory you could just make a bunch of graphs in Excel to come up with a plan of action, you’re certain that it would take too long.

That’s when you use machine learning instead. “Machine Learning” is the name originally used in computer science departments to refer to a set of techniques that are used to find patterns in a dataset. Those patterns are then applied to allow a computer program to do things that it wasn’t explicitly programmed to do. In general, for a computer program to do things it wasn’t explicitly programmed to do, it has to make predictions. Which is why all machine learning is predictive. Read on to understand why.

Machine Learning

There’s no better way to understand why prediction is at the center of machine learning than to work through an example.

Suppose you have a CSV file where each row is a customer, and each column is some information about that customer. Maybe the columns are “Company Name”, “Headcount”, “Annual Revenue”, and “Industry”. Suppose there are 10,000 rows.

Let’s throw on two extra columns with the header “Deal was Closed-Won” and “Deal Size”. Suppose about 2,000 rows have those columns filled in—the first of the two has the values “Yes” or “No”, and the second contains a number that tells you how much the deal was worth.

Now, you want to predict which of the other 8,000 are good targets for you to sell to. This is a great problem to throw at a machine learning algorithm. Yes, with only six total columns—one of which is just a label (company name), you could probably derive some insights about your customers without any machine learning. But for simplicity we’ll stick with a relatively small dataset.

The job of a machine learning algorithm is to find some relationship among the four columns you have that determines whether they’re a good buyer. Ultimately, the problem is one of filling in missing values, which is the same thing as prediction. You want the algorithm to predict what values would be in the missing cells if they were filled in.

Classification

If you looked at the dataset long enough maybe you’d notice some pattern, like “companies don’t buy from us when they have over $10M in annual revenue”. Or “Very few companies in Pharma have bought our product.” Identifying specific rules like this is the output of a model category called a decision tree. If you build a bunch of decision trees and look at the rules generated by all of them, you have a random forest.

Either way, this problem is a classification problem, because you want to classify each company into one of two classes—either “YES” or “NO” in the “Closed a deal with them” category.

Regression

But it’s more powerful to try to predict how much a deal would be worth with each of these companies. When you’re trying to predict a number instead, you’re dealing with a regression problem.

Say you look at the data long enough to realize, “hey, I can guess the rough deal size for each of these companies by just looking at the headcount.” That would likely be the case if you sell your product on a per-seat basis, for example.

Suppose you discover that annual revenue is roughly within 30% of the following:

$$ \text{Headcount} * \underbrace{0.4}_{\text{Average fraction of company that uses your product}} * \underbrace{600}_{\text{Annual price per seat}} + \underbrace{100}_{\text{setup fee}} $$

Congratulations: your brain just built a linear regression model. It’s called a Linear model because the only thing you’re doing with your inputs is multiplying by them by stuff and adding stuff to them. You multiplied headcount by 0.4, multiplied that by the per seat cost, and added the per-seat setup fee. And the only input you used was headcount.

Suppose instead you discovered that Pharma companies with a low headcount don’t generate as much revenue as Software companies with a low headcount. A linear model as simple as the form above wouldn’t be able to capture that subtlety. You’d need a more complex linear model—one that looks at pairs of inputs together.

Or maybe it were the case that your deal size goes up with revenue up to companies of a certain size, but then it dips down a bit. A linear model wouldn’t be able to capture that complexity at all. You would need a nonlinear model, which is a model that simply does more with your inputs than multiplying them by stuff and adding stuff to them.

Unsupervised Learning

All of this has been an example of supervised learning merely by virtue of the fact that you have those two extra columns to work with, the numerical variable for revenue, and the categorical variable for whether you closed a deal with each company.

If you didn’t have those values, you could still use a machine learning algorithm to do something useful with your data. You would need to use an unsupervised learning algorithm, which is an algorithm that finds structure in your data without trying to predict any particular value.

This is still a form of prediction, however—the algorithm is trying to predict something about the world that generated this data.

For example, an unsupervised learning algorithm might find that if you draw a graph with annual revenue on one axis and headcount on the other, the two things tend to fall on a line. But there are some really efficient companies that have outsized revenue relative to headcount, which is often the case for Tech companies, and other companies that need way more headcount to achieve the same revenue—often the case for low-tech companies.

The algorithm would then be able to “predict” for each new company whether it’s highly efficient, highly inefficient, or normal. You could then use that prediction to decide who to sell to. Presumably, highly efficient companies have bigger margins and more budget to spend on stuff.

Wrapping up

We hope these examples help shed some light on what people are saying when they talk about machine learning and predictive analytics, as well as on the meanings of common terminology people use when taking about machine learning.

Next time you’re at a conference and you hear someone say something poorly supported or otherwise nonsensical, give them a piece of your mind for us.