Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
Showing results for 
Search instead for 
Did you mean: 
Active Contributor
0 Kudos
So I have been lately playing a lot with SAP HANA's PAL library for different requirements and I can say that its a very good package but only if you know what to use and what not.

 Introducing myself, I am an ABAP / CRM techie with additional experience in area of practical Machine learning and have been applying the same in various problems like Anomaly detection/ prediction/ classification etc. in different platforms like Octave, Python and recently SAP.


The problem statement here is a very common one. Suppose you have a situation where you have to predict/ forecast the sales of a company w.r.t number of leads it has. This can be termed as a regression model problem and if you have little knowledge of PAL; Polynomial regression, logarithmic regression etc. will start hovering in your minds. One would even jump to applying the SAP PAL function ( eg: POLYREG ), simply train a model and jump on the one with the R2 score closest to 1 but wait before you read below !!!!!

 Problem statement:

Due to lack of knowledge of concepts of data science and machine learning, practitioners like us tend to fall in the trap of R2 score because it appears to be a simple statistic to check whether the model is good or not.

 R2 score:

An R2 score is the value which shows how good it fits your training data. However there's a difference between fitting and optimal fitting. When it comes to predictability efficiency of a model, the R2 score becomes invalid because it is a measure of how well your training data fits the model and nothing about the predictability.

Usually a high R2 score means a high possibility of "High variance". There have been instances in my experience where a R2 score of example: 0.983 fits far more optimally than models of R2 score 0.99 or 0.992 etc. I have seen many people talking about achieving high R2 score, being closer to R2 = 1. However it's not strange that many of us are not aware about the Variance-Bias problem a higher or a very low R2 score brings.


Bias Vs Variance trade-off:

So what is this term high variance / high bias ?


High bias: When the model fits so bad that it doesn't fit the input data well and the curve is more like an unfit line, bad fitting. The curve is too blunt and prediction using this curve will have unreal variance. 

High variance: When the model fits so good that it fits the input data but the curve appears too unreal, good fitting but not optimal fitting. The curve is too sharp and prediction using this curve will have unreal variance. 

Just optimal: A model which fits optimally to the input data and the curve also appears real. The curve will be smooth and prediction using this curve will have close to real prediction.

So how do I chose the most "optimal" model ?:

Now the problem comes as to how to decide the best model. For the selection of the best model we will follow k-fold model selection algorithm, follow the simple steps below:

1. Divide your training set into n-bins of data, n can be a value from 2 to 10 and defines the number of times we would run the prediction and training on the same set, I usually prefer 4.

2. For each bin from 1 to n, pick up Mth bin which you mark as cross validation bin and all other n-1 bins as training bin.

3. Train your regression model only on Training bin

4. Predict output on the Cross-validation set

5. Follow this step as this is how you chose the best model:

  1. Calculate set error as average of sum of [{ Output of step 4 - Y value from Cross validation set } ** 2 ]

6. Repeat steps 2 to 5 for each bin out of n-bins.

7. Repeat steps 1 to 6 for each model, average out the error calculated from step 5.1 for each model run and pick up the model with the least value.

This is usually the best practices of Machine learning, which I always follow to pick up the best model, rather than relying on R2 scores.
Labels in this area