In this blog post, I will explain the machine learning algorithms underlying classification in SAC Smart Predict. The focus will be on building the theoretical framework necessary to truly understand the output produced within SAP Analytics Cloud.
Martin Bregninge's previous post, ‘Understanding Regression with Smart Predict’, shares many similarities with this one, but we have intentionally written both posts to be read and understood independently of the other.
If you have already read the regression-post, many of the concepts discussed here will be familiar to you already. I have included small boxes like this one, describing sections that can be skipped, and providing additional insight by comparing classification to regression.
This section should not be skipped. Note that the setup for classification is the same as regression, except that our target value is a binary dimension instead of a measure.
Classification is the task of sorting objects into categories (or classes), based on the known values of some influencer variables that describe the objects. For example, classification can help us answer such questions as:
Influencer variables can be both dimensions (nominal variables such as the gender of a potential customer) and measures (numerical variables such as the temperature at which an item was manufactured). The target value must be a dimension, and each possible member of the dimension corresponds to a different category that the object can belong to.
You might notice that all the examples I’ve given are ‘yes/no’-questions – in other words, the target value can only take on two distinct values, such as ‘faulty’ or ‘not faulty’.
I have limited myself to these examples because the algorithm used by SAC Smart Predict performs binary classification; we cannot sort objects into more than two categories. Serious as this limitation may seem, it has its benefits. For one, it allows us to easily estimate the probability of belonging to each category (we’ll see how in a minute!) rather than just giving a deterministic classification. If we wish to classify many objects, we can thus sort them in order of probability, allowing us to reformulate our questions slightly:
This section can be skipped. In short: We will try to find a function which gives the probability that a certain input belongs to class 1 rather than class 0. Our final classification is based on this probability. Figure 2 illustrates how the classification problem can be reframed as a regression problem.
Throughout this blog post, I will return to the example of using influencer variables age and income to predict which potential customers are most likely to be receptive to our marketing.
To make our predictions, we will need a large set of labeled training data: people for which we know the influencer variables, and whether they are marketing-receptive or not. We might plot our training data as shown in figure 1(a).
Our goal will be to determine a function, f, which – given the age and income of a person – outputs an estimate of the probability that this person is marketing-receptive. For our toy dataset, one possible function is shown by the background color in figure 1(b). It should be clear for us as humans that this is a reasonable choice of the probability function. In the section on Maximum Likelihood, we will define formally what constitutes the ‘best’ choice, so that a computer can understand it.
For now, just note that (almost) all the marketing-receptive customers are predicted to be likely to be marketing-receptive, and similarly for the unreceptive customers. The model lines up well with reality, so it seems reasonable to expect that it will also make good predictions in new cases.
From the probability function, we can create a decision boundary (the black lines in figure 1(b)) which is used to perform the final classification. The decision is made by asking if the probability of being class 1 is above some threshold (in this case 50%).
Figure 1(a): Our training dataset contains 150 people. We know their age, their monthly income, and whether they are marketing-receptive customers or not.
Figure 1(b): A reasonable probability function and decision boundary. We will predict customers to be receptive if they are within one of the red areas. Note that not every point is correctly classified – that won’t always be possible.
Instead of using the background color to represent the probability f(x) (as in figure 1(b)), we could use a third axis, as shown in figure 2. We also group the training examples along the new axis according to their target value (encoded either as a 0 or 1, since there are only two possible target values).
For those familiar with regression, this visualization should make it clear that the regression framework is directly applicable. The only difference to simple regression, as we will see shortly, is that we can improve our loss function somewhat, because we want to interpret f as a probability function.
Figure 2: Each training data point consists of two influencer variable values and a target category. By assigning a number to each category (either 0 or 1) we can represent the points in three dimensions, and our desired probability function should look similar to the one in the figure.
This section should not be skipped.
Let’s try to formalize and generalize our setup: We will store the influencer variables of the i’th training example in a vector, xi, and we will encode the corresponding target value in a binary variable, yi. For example, we could have yi = 1 corresponding to a marketing-receptive customer and yi = 0 corresponding to the opposite.
There are two important questions to answer to find a good probability function f:
Since Smart Predict is meant to work for almost any dataset, we want the function type to have a lot of flexibility, so that it can describe very different relationships between input and output. This flexibility is provided by using a decision tree (explained below). We also want to ensure that the final output is a number between 0 and 1, so we run the decision tree’s output through a logistic sigmoid function (see figure 3).
Figure 3: The graph of the logistic sigmoid, σ(x) = 1/(1+exp(x)). It squeezes all the real numbers into the range between 0 and 1, while still retaining all the information, because it is an invertible function.
If you’ve ever heard of logistic regression, this idea of running a flexible function – in that case, a standard linear model – through a logistic sigmoid should be familiar.
The rest of this section is skippable.
But in our case, we use a decision tree instead of a standard linear model. A decision tree essentially asks a series of questions of its input and gives an output depending on the answers. The decision tree makes no prior assumptions about the relationships we want to describe, and it can handle both dimensions and measures as input. A very simple decision tree (followed by the logistic sigmoid) that tries to determine if a customer is marketing-receptive could look like this:
Figure 3: The graph of the logistic sigmoid, σ(x) = 1/(1+exp(x)). It squeezes all the real numbers into the range between 0 and 1, while still retaining all the information, because it is an invertible function.
In each ‘leaf’, the decision tree guesses on a constant value – the same for every input that ends in that leaf – which is then run through the logistic sigmoid
This section should not be skipped.
We need to define a measure of how ‘good’ or ‘bad’ a given model is, so that we can find the best one.
Recall that we used the Residual Sum of Squares as our Loss function for regression, and we wanted to minimize it, because it measured how ‘bad’ our model was; in the case of classification, we have a slightly more complex expression, and we want to maximize it, but the approach is the same.
We stated in the beginning that we want the predicted probabilities on all the training examples to match up well with the true category. In other words, the probability of the actual observations according to the model should be as large as possible. Let’s unpack what this means:
If f(xi) = 0.9, then the model predicts that there’s a 90% chance that example i is category 1, and a 10% chance it is category 0.
Therefore, if the actual value of yi is 1, the probability of that is 90% according to the model. On the other hand, if yi = 0, the probability of the actual observation according to the model is just 10%.
We can write this as
in which P(yi | f) should be read ‘the probability of the actual observation, yi, given that the model, f, is correct’. The second equality is true since lifting to the 0’th power makes any number into 1 (you can easily check that the two sides of the equality sign are the same in each of the cases yi = 0 and yi = 1).
By multiplying these probabilities from every point, we get the probability of all the observations given the model. Our final goal is thus to maximize
where n is the number of training examples.
Figure 5: The probability of the observations given the model is calculated by evaluating the function in each point and multiplying the resulting probabilities.
The above explains everything important about the idea of Maximum Likelihood estimation: We want to find the function that maximizes the probability of the observed data, which is given by equation (1). There’s still quite a bit of math to work through before we can find this function – it will be described in the appendix, but it requires that you have read the sections on finding the best decision tree and Gradient Boosting first.
This section is the same as in the regression blog post, except that we use the new loss-function.
We need to determine which questions to ask in the decision tree, and what probabilities to assign to each leaf. The possible questions that can be asked are:
The algorithm we will use to (hopefully) maximize formula (1) is a greedy algorithm which finds one new question to ask at a time. It starts off with an empty tree, chooses the ‘best’ question to ask, and then repeats the process on each of the created subbranches.
More formally:
This section is almost exactly the same as in the regression blog post and can safely be skipped. You might want to consider figure 6.
In the previous section, I elegantly avoided the question of how to select the maximal depth of a decision tree.
If we constrain the maximal depth a lot, our tree will only be able to divide the input space into a few subspaces and might not be able to describe the output very precisely (in figure 4, for example, we can only split the population into 6 different groups).
On the other hand, if we allow our tree to grow very deep – potentially splitting the input space into as many subspaces as there are training data points – we can obtain 100% accuracy on the training data. But if we do this, our model will be very prone to what is known as overfitting. Simply put, a model is overfitting if it performs very well on the training data, but its predictions don’t generalize well to new data (which is what we’re really interested in). Overfitting usually occurs when we train a very complex model (i.e., one with many parameters, such as a deep decision tree) on a limited amount of training data; the model can simply ‘memorize’ the training examples rather than find the true relationship between the influencer variables and the target value, as illustrated in figure 6.
Figure 6. Attempting to fit decision trees of varying complexity to a simple classification problem with just one influencer variable, x. The blue curve describes the true probability that a certain object is category 1 given a known x-value. As such, the graph of our decision tree should match it as closely as possible. The red and blue dots are the training examples. It is clear that the red curve – made by allowing the decision tree to split the input range 20 times – fits the training data very well but won’t generalize as well to new data as the simpler models.
To avoid overfitting, while still being able to describe the potentially complex relationships in the data, SAC Smart Predict makes use of an algorithm known as Gradient Boosting. The idea is to use the combined outputs from an ensemble of shallow decision trees to make our predictions –individually, they are so simple that they couldn’t possibly overfit, yet when their efforts combine, they are able to describe complex relationships.
Figure 7. Combining decision trees to avoid overfitting
It starts with the simplest imaginable decision tree – which simply outputs a constant value regardless of its input. We then calculate the error committed by this first tree in every training example, and train a new, small decision tree to minimize the error when it is added to the first. This process is repeated until the error becomes sufficiently small or a maximum number of trees is reached.
As nice as it is to know that a certain customer is 75% likely to be marketing-receptive, there are many applications in which we will need to perform a final clear-cut classification of our objects.
This is a relatively simple task once the probability function, f, has been determined. We simply choose some threshold and classify a new object, xnew, as being category 1 if f(xnew) is above this threshold. In figure 1(b), I used 50% as the threshold to draw the decision boundary. This is of course a natural choice, but in some cases, we might prefer different thresholds. If, fx., we have 20000 potential customers and the capacity to market towards 2000 of them, we would just choose the 2000 that were most likely to be marketing-receptive (which could just as well correspond to a threshold at 25% or 75%). This form of thresholding is quite common, and in the Smart Predict interface, it is controlled by the ‘Contacted population’ slider.
As I have attempted to illustrate, the key ideas needed to understand classification are almost exactly the same as the ones needed to understand regression. In a sense, for a reader who understands regression, this entire blog post can be summarized by figure 5. Everything else is just technicalities!
If you learned something new from this blog post, we would appreciate it if you would give it a like. You should always feel free to ask me and my SAC Team colleagues any questions that you may have in the comments.
See also:
Written by the Innologic SAC Team
Martin Bregninge, Andreas Kroon and Rasmus Ebbesen.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
7 | |
5 | |
5 | |
5 | |
5 | |
5 | |
4 | |
4 | |
4 | |
3 |