You’ve got to love collective nouns. A ballet of swans, a barrel of monkeys, a charm of magpies. What is a group of machine learning models? That would be (perhaps arguably) an ensemble of models. A group of algorithms that go about their business with consensus. So what’s SAP Analytics Cloud got to do with ensembles? Well, this month (Mar 2021) we just replaced the machinery under the hood that powers it’s predictive modelling and we’re now all about ensembles. An ensemble of trees actually. Confused much?
Allow me to explain.
Smart Predict has machine learning models that business users can leverage and adapt to their own scenarios. Little knowledge of statistics and machine learning is required. With good understanding of the data and problem statement, models can be configured to make predictions like employee attrition, customer churn, a patient’s medical bill amount, weight of input to a blast furnace, etc. See references for more information on how to build these models.
The model that powered these predictions, came to SAP Analytics Cloud from SAP’s acquisition of KXEN, an American software company, that primarily marketed predictive analytics software. KXEN’s models strengthened SAP Analytics Cloud’s augmented analytics capability and helped earn that spot in the “Visionary” quadrant of Gartner’s magic quadrant for BI solutions.
KXEN’s classification & regression models were based on an algorithm called Ridge Regression. This falls under a family of machine learning models called “Regularization models” that stem from ordinary least squares regression (OLS). What I describe below for both old and new models is the broad intuition, glossing over details of SAP’s exact implementation under the hood, which are proprietary.
In OLS, you fit your data points onto a linear equation (say y = mx + c) that best describes it. In my toy training example of 10 data points below, I have a “y” that I want predicted for every “x” I provide as input. The blue points are the true position of these data points while the green values would be my prediction, if I model my curve as the red line. I can see in my 8th data point, that the prediction is 9 units away from the truth. This is my error at the 8th data point. One way of calculating total error, is to add up all my errors. But this would mean the positive and negative errors would cancel each other out a little bit. To avoid this, we square the errors and add them up. Your best model would be the line that gives you the least squared error. This is the intuition behind ordinary least squares regression. You can perhaps now see how OLS gets its name. The slope m (aka coefficient of the x variable) and the bias c, are 2 parameters that the OLS model learns from the data it was trained on.
Error or Loss Function = minimise (yTruth -ypredicted )2
However, we find that OLS often over fits the data, which means it tries too hard to fit every data point into its line. As a result, when it predicts on data points it hasn’t seen during model fitting, it falters. It doesn’t “generalise” enough. The concept of generalisation finds application across machine learning, all the way down to the most cutting edge models in deep learning today. Regularisation was a baby step taken to teach models to “generalise”.
Ridge Regression adds a penalty term to OLS that forces the model to select lowest possible coefficients. To our earlier Error Function we now add the square of the slope (m2 or coefficient2). Notice the λ next to this penalty term which serves an interesting purpose. If λ=0, the model reduces to OLS. As λ increases the coefficients shrink to 0 (since you are minimising this function). At λ equals to infinity, the coefficient is 0. If m=0 in y=mx+c it will effectively make your predictor disappear from the model. You would have a model with no predictors. You are looking then for a middle ground. Ridge regression uses a technique called cross validation to find an optimal λ which is that middle ground.
Error or Loss Function = minimise (yTruth -ypredicted )2 + λ (coefficient2)
Once I know my new coefficient, m and bias, c, for any new data point that comes my way, I will plug in the value of x in my equation (mx+c) and the result y will be my prediction. Of course, practically you will have more than one predictor. You can extend the formulation by replacing x by x1, x2, x3 and so on. Similarly replace coefficient by coefficent1, coefficent2, coefficent3 and so on.
Ridge regression has been found to work very well when you have far too many predictors and too less data points (aka p>n problems). Unlike OLS, it works very well even when it is given predictors that are correlated with each other. However, it isn’t always necessary that your predictions have a linear relationship with your predictors. Ridge regression is an idea that has been around in statistics since the 70s, as you can see from the research paper referenced below that first proposed it. It has grown to be a powerful model with modern advances in data crunching. Nevertheless, more powerful models have come since then and taken centre stage in machine learning as we know it today.
In 1999 Friedman proposed a minor modification to an earlier algorithm that provided substantial improvement in model performance. The idea has since been worked on and revised by Friedman himself and many others and lead to the birth of Gradient Boosting Models. This model was found to be wildly successful in the competitions hosted by the ML community in Kaggle. The idea has gone on to find applications in search engines, high energy physics and even had a role to play in the discovery of the Higgs Boson. How the butterfly flaps its wings indeed.
At the heart of gradient boosting lies a very fundamental unit called the tree. It’s called a tree, because it has branches and leaves. Clearly someone got very clever and took the creative liberty of calling what you see in my image below a tree. In my toy example below. I again have 10 data points. I will rephrase my problem slightly, to make it easier to understand. Let’s say for any 2 predictors along the x and y axis, I need to predict if the data point is of type “Blue” or type “Green”. How then can I carve this 2 dimensional space, so I can figure out the best rule that describes this data? I build a flow chart that says, if the value of x is less than 2 then call it type “Blue” but if >=2 call it type “Green”. At my leaves, you notice all green and all blue. Practically, the leaves will have one class in majority. If a new data point is presented to me, I will flow it down my tree and see whether it lands in the majority “Blue” leaf or “Green” leaf and label it accordingly. It is straightforward to extend the intuition to predictions which are numbers. The numerical prediction for each leaf will be the mean of observations that fall there.
More complex trees with more complex rules can be created. However, like OLS, one tree on a whole data set will overfit and try too hard to fit all data points into a rule. For each tree there are a couple of decisions to be made:
This brings us to the question of how to “generalise” better and from this a whole class of machine learning models called tree based models burst forth. Instead of a single tree, we build a collection of trees - an ensemble of trees, that we consult for consensus. Think of this as a council of elders that have their own experiences with the information and are suggesting an outcome. The biblical proverb comes to mind - “Where there is no counsel, the people fall; But in the multitude of counsellors there is safety.” Proverbs 11:14
In gradient boosting method, our multitude of counsellors are intentionally weakened with a parameter similar to the λ of Ridge Regression. The core intuition is you iteratively build weak learners and then aggregate their findings to build one very strong learner.
ŷ = λ * T1(x) + R1
R1 = λ * T2(x) + R2
ŷ = λ * T1(x) + λ * T2(x) + R2
R2 = λ * T3(x) + R3
ŷ = λ * T1(x) + λ * T2(x) + λ * T3(x) + R3
Rn = λ * Tn(x)
ŷ = λ * T1(x) + λ * T2(x) + λ * T3(x) + ….. + λ * Tn(x)
or you have busted the limit of trees you planned to build (B) but by then your RB is minuscule enough to be ignored.
RB-1 = λ * Tn(x) + RB
ŷ = λ * T1(x) + λ * T2(x) + λ * T3(x) + ….. + λ * TB(x) + RB
All the methods and variables described above are parameters for the Gradient Boosting Algorithm. Knobs that can be turned to different values so you may “tune” the output signal. This leads to an army of weak learners that end up predicting the final outcome with astonishing accuracy. This has been proven time and again in competitions involving machine learning models, where Gradient Boosting models often outperform others. Today, for structured data sets, Gradient Boosting is a class of algorithms that are considered state of the art, and the best of the best.
And this is why SAP Analytics Cloud, gave its innards a revamp. The old has been cast away, and a new guard has taken its place. This guard, is an ensemble of trees, the biblical multitude of counsellors, whose counsel we believe businesses will benefit from.
So… have you counselled with our trees yet? 😉
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
26 | |
22 | |
19 | |
13 | |
10 | |
9 | |
9 | |
8 | |
7 | |
7 |