How is it possible that a tool can create predictive models automatically? How can steps be automated, that would take a highly skilled expert a lot of time and effort to produce manually?
The Automated Analytics module in SAP Predictive Analytics enables a non-statistician to produce powerful predictive models in a short period of time. Focus has been placed on automating all steps of the predictive modeling workflow, shielding the user from statistical complexities without compromising the predictive performance. Use cases for this engine are virtually endless, from product recommendations to preventive maintenance just to name two.
This document explains the fundamental concept behind the automated approach to the interested reader. However, the end user of this functionality does not have to read this document to use the functionality. Please also bear in mind that this document is just giving an outline of the fundamental concept behind Automated Analytics. There are many more detailed aspects that are taken care of, some of which are patented.
A big thank you goes to Lemine Abdallahi from our Product Group for his expert advice on many detailed aspects. In case you are keen to read further, I suggest the "Automated Analytics User Guides and Scenarios".
Introduction
SAP Predictive Analytics 2.x includes two different approaches to predictive modeling
- Automated Analytics, which focusses on simplifying the creation of strong predictive models though automating all individual steps of the creation process. Due to the high degree of automation, Automated Analytics enables Analysts without deep statistical education to create powerful predictions. This white paper will explain the concept behind the automation in more detail. This engine came to the SAP portfolio with the acquisition of KXEN in 2013.
- Expert Analytics, which provides a graphical workbench to an expert user who wants to implement specific statistical algorithms and workflows. Expert Analytics is targetted towards Data Scientists familiar with individual statistical algorithms, their assumptions and implementations.
The methodology of Automated Analytics is heavily based on discoveries made by Vladimir Vapnik in the area of Structural Risk Minimization. Our framework, which is patented in parts, builds on this concept to produce high quality models with little effort. Many users appreciate this concept, as it can quickly provide business value. However, some expert users are often interested in better understanding more about how these models are created. This document aims to help understand how Automated Analytics can produce these predictive models.
Within the Automated Analytics option a number of different predictive capabilities are available, such as
- Classification
- Regression
- Time Series Forecasting
- Clustering
This document focusses primarily on the Classification aspect.
Automated Analytics is not a black box. It is a comprehensive concept that can be explained in detail.
Variable Creation
A predictive model is built on the concept that it understands and describes what happened in the past, so that it can use this knowledge to predict what is likely to happen in the future.
Think of a retail bank that wants to optimize its Marketing. One good option to improve the efficiency of a Marketing campaign is to understand the likelihood of a customer being interested in a certain product. Such an analysis is done with a classification model on the historic information of the existing customer base. All customer data that is available can be used, for instance demographical (age, location, marital status, ...) or behavioural (loan status, credit card usage, ...).
The so called target variable indicates whether an individual customer did or did not purchase this product. The classification model will look for patterns, that describe the customers behaviour before that purchase. This model is then applied on the most recent data to predict other customers' interest in the same product, resulting in an individual probability by customer. Now you know which customers are most interested and you can incorporate this in your Marketing campaigns to achieve a more tailored customer communication, resulting in increased response rates and/or reduced Marketing costs.
The better the historic data describes the customers behaviour, the better any resulting model will be. In order to describe the customer behaviour very detailed, it is helpful to create additional variables that give additional insight into the customer's activities. The Data Manager within SAP Predictive Analytics helps the user create such additional variables on the fly. These variables are created semantically, without the need to persist the results in additional database tables.
Here are some examples
- Aggregation
Your data might hold detailed transactional data about activities in your customers' accounts. Out of this history, the Data Manager within SAP Predictive Analytics can create aggregates such as the count of the transactions or the average amount per transaction. These aggregates can be fed directly into the model, which will lead to better predictions.
- Pivoting
Pivoting builds on the above aggregation and makes it easy to graphically create a large number of more detailed aggregates. Sticking to the same example of activities in a customer account, pivoting can create individual counts of transaction types. So you can have transaction counts by cash withdrawals, credit card payments, standing orders, and so on.
- Understanding of Time
Typically it is crucial to put the historic data into a context of time. Therefore the Data Manager has an in-built concept to relate the various measures to moments or ranges in time. Again, without the need for any coding, the Modeler can create detailed variables such as
- Count of cash withdrawals in the previous quarter.
- Count of cash withdrawals in the same quarter the year before.
- Change in cash withdrawal counts in absolute values.
- Change in cash withdrawal counts in percent.
Very quickly you can have dozens, hundreds of even thousand of columns that describe the customers behaviour very detailed over time. With these additional columns we now have a much clearer picture of the customers' profiles which generally results in much better predictions. We will explain later in this document (Chapter "Variable Selection") how SAP Predictive Analytics can handle such large number of columns. The end user does not have to eliminate any columns, which could inadvertently remove valuable details. You can keep all information and let Automated Analytics remove any columns that are not relevant for the model.
The understanding of time is implemented with a concept that a model can relate to a certain point in time (called timestamp). Information from before the timestamp is used to train the model. The target variable is based on information from after the timestamp. Once the model is trained, it is applied with a more recent timestamp (or even today's datetime) to predict the target variable. This timestamp concept makes it easy to create models that can be used for many time periods without manual intervention. The "Model Monitoring" chapter further below outlines how the models are automatically monitored to verify whether their predictive capabilities are still adequate or whether the customers' behaviour has changed so much that a model needs to be retrained.
Training the model
Applying the model
Variable Encoding / Data Distribution
In order to automate the whole process of creating predictive models, it is important that no assumptions are made on how the data is distributed in the predictor or target variables. Most traditional predictive techniques are based on assumptions on the distribution of the data. Often they require balanced datasets for classification. This means that within the training dataset for a binary classification half the records should belong to one class and the other half to another class. Such a distribution is rather the exception though in reality. Imagine a churn analysis. Most likely only a small percentage, or even much less, of the population will be a churner.
Automated Analytics is built to deal well with such unbalanced datasets. The methodology is “distribution-free”, meaning no assumptions are made onthe distribution of the various variables. You do not have to put any manual effort into trying to achieve a certain distribution. Automated Analytics achieves this goal by encoding all predictor variables into smaller subsets which have similar impact on the target variable.
Variable encoding is specific to the different variable types:
- Nominal (textual) variables are encoded by grouping values with similar impact on the target variable into common bins. Less frequent values are grouped in bigger and therefore more robust categories. Examples for nominal variables are Country, Material or Product.
- Ordinal (sorted textual) variables are encoded similar to the above nominal variables. However, the sequence is taken into account and the bins contain only consecutive values. Examples for ordinal variables are Delivery Status or Loyalty Status.
- Numerical variables are sorted and split in different units with equal record share. By default 20 units are created, each containing 5% of the available data. Consecutive units with similar impact on the target are grouped together. Within each group a separate linear regression transforms the input data into a more robust representation.
These transformations enable predictions without assumptions of the data distribution. At the same time, the model becomes more robust to better predict previously unseen rows of data.
Screenshot for the encoding of a numerical variable in 20 bins:
Screenshot of how these 20 bins were combined into a much smaller number of groups:
Data Splitting
In the Automated Analytics methodology, the dataset used to used to train the model is split automatically into different parts:
- Estimation: Various models are created on this dataset.
- Validation: Each model that was created on the Estimation data is validated on this data part. The best model will be chosen based on the results of this validation. The chapter "Model Selection" further below explains how this model is selected.
- Test: The chosen model is run against this Test data, which has not yet been used, to provide performance statistics. These statistics show how the model performs on completely new data.
A number of different data splitting (also called data cutting) strategies are provided. The user can work with the default settings, which splits the data into sets for Estimation and Validation. However, if desired the user can also modify this configuration, such as adding the Test segment or by selecting how the individual rows are assigned to the different data parts.
Missing Values
Missing values, you can also call them empty cells, often present a challenge in conventional predictive analysis. Many algorithms cannot handle such missing values. Therefore the user in such an environment often needs to spend extra time dealing with the missing information. Typically, one can try to estimate the missing values or one can delete the whole row or even column. So either one has to invest extra effort or valuable information is getting lost.
The Automated Mode however handles missing values completely automated.
Scenario 1:
If the Estimation data part already includes missing values, an additional category is created for these entries. This new category is now treated equally to the existing categories. This also applies for numerical variables. For each numerical column, all cells with missing data are placed into an additional group whose impact on the target is calculated just as it is for each numerical bin or category.
Scenario 2:
In case the Estimation data part is complete without any missing values, but missing cells are encountered when applying the model, these are handled as follows
- For a continuous variable, the cell is filled with the variable's average.
- For a categorical variable, the cell is filled with the most frequent value.
Missing values are therefore handled without extra effort and without having to exclude valuable rows from the datasets.
An additional bucket is created for the missing values (see the first line labelled "KxMissing").
Outliers
Outliers can come in two types, both of which are handled automatically:
- Unusual values in predictor variables. These can be extremely low or high values for numerical variables and rare values for nominal/ordinal variables.
- Unusual rows of data, which might warrant special attention.
Outliers in numerical variables are placed in the bin for the smallest or largest values of the encoded variable. Outliers in nominal/ordinal variables are placed in a common group with other infrequent values.
Unusual rows of data can be flagged by Automated Analytics for manual investigation. Simply put, a row is flagged as an outlier, in case the predicted value is very different to the actual value.
Model Selection / Variable Selection
Automated Analytics selects the best model based on a balance of
- Accuracy
- Robustness
- Simplicity (reduced number of variables)
The model's accuracy is described by an indicator called "Predictive Power", often abbreviated as KI. The indicator describes the proportion of information contained in the target variable, that the explanatory variables are able to explain. The higher the "Predictive Power", the better. In order to increase "Predictive Power" you can try adding additional variables. The range of possible values for the "Predictive Power" is from 0 to 1.
Robustness is described by an indicator called "Prediction Confidence", often abbreviated as KR. It describes the ability of the model to achieve the same performance when it is applied to a new dataset. To increase "Predictive Confidence" you can try adding additional rows of data. The range of possible values for "Predictive Confidence" is from 0 to 1. Models are generally considered robust if the value is >= 0.95.
Simplicity is achieved by favouring models with reduced set of variables.
To strike the right balance between accuracy, robustness and simplicity, Automated Analytics goes through an interative process to find the most suitable model.
At first multiple Ridge Regressions are created on the whole dataset. Ridge Regressions are configured with a lambda parameter and many models are created with different lambda values. Out of those models, the one with the largest sum of "Predictive Power" and "Prediction Confidence" is selected. Now an iterative process starts which tries to find an even better model.
- The variables with the smallest impact on the model get eliminated.
- A new set of Ridge Regressions with different values for lambda are produced on the smaller dataset.
- The best model is chosen again. If this model is better than the selected one from before, this model becomes the model that has to be beaten.
- The process continues with step 1) until the sum of "Predictive Power" and "Prediction Confidence" of the best model in step 3) is smaller than before.
Eventually, the model with the highest sum for "Predictive Power" and "Prediction Confidence" becomes the chosen model, delivering the best compromise between accuracy, robustness and simplicity.
"Predictive Power" and "Prediction Confidence" of selected model:
Multi-Collinearity
Multi-collinearity is often a concern when training and applying models. Multi-collinearity occurs when two or more predictor variables are highly correlated. Think of a banking customer of whom we know the monthly salary and how much money the person is paying into a savings account each month. These two variables Salary and Savings will be highly correlated. Generally speaking, and there will be exceptions of course: the higher a person's salary, the more money will be saved.
A predictive model needs to take such a relation into account, to ensure that the information does not double-impact any prediction. The ridge regression used by the Automated Mode in SAP Predictive Analytics is robust against such multi-collinearity. Suitable weights are assigned to the variables, so that each variable contributes to the model according to its actual additional information gain. Out of two correlated variables, the most important one will have a larger impact. The second variable's impact is reduced accordingly and might not even be included in the final model at all.
Model Interpretation
So we know now the most important concepts how predictive models are created automatically. But it is still very important to understand an individual model, either for your own confidence or to be able to communicate and discuss the model with colleagues. Various options are given to help the user understand the model.
Commonly used are for instance
- Various types of model charts, ie gain charts. The model is the blue line in between a random model in red and the perfect but unachievable model in green. Simply put, the closer the blue model gets to the perfect green model the better
- Overview of the selected variables and their contribution/weight in the model.
- An overview of each variable, how the content impacts the model. Positive influence on the target means an increased likelihood of being a target. Here the group above 58 years of age has the highest likelihood.
- Detailed statistical reports.
- Confusion Matrix, which calculates how many targets can be identified with a certain effort. Here for instance only 5% of the prospects need to be contacted to win 23% of the possible contracts.
- Powerpoint Output, the most important information to document or present.
Deployment
Models can be put into action in two different ways.
Persistence
The model is applied once on a given dataset and the resulting classification scores are permanently written to a database or file. These values can now be used by any other process or application. Only when the model is reapplied, new scores are calculated and persisted again. Such scoring can be applied in-database without the records having to leave the database.
This persistance-approach is often used when a clear separation between applying the model and using the results is desired. If for instance you would like to give the scores to a third party, such as an external Marketing agency, persisting the results is needed.
Semantic
Alternatively, the model can be turned into source code of different programming languages. The models can then be embedded directly into databases or applications, and the scores are calculated on the fly whenever needed. Many different programming languages are supported, such as Various SQL flavours, C, JavaScript, Java, Visual Basic or SAS. It is very common for instance to embed the model as new column in a database view or stored procedure. Everytime the column is used, the score is calculated taking the latest available information into account. The scores are real-time.
A customer advisor can now benefit from this additional information directly in the Customer Relationship Management System. The predictive models control which products are suggested for the customer or how likely the customer is to leave to another bank.
Model Monitoring
Once a model has been created, it describes the data as it was available at creation time. Most models are used for longer time periods however. The bank for instance will continuously want to analyse the churn risk of loosing individual clients. Therefore a churn model, giving the probability of a client leaving, will be in constant use. If the behaviour of the clients is changing over time, the model's predictive power will reduce with it. As it was created when the behavioural pattern were different, it won't be able to predict the churn rate as accurately. Therefore the predictive capability of such models needs to be monitored over time.
That monitoring task is automated through a server-component, which automatically and regularly checks on the model's predictive power. Should the predictive capability fall below a defined threshold, the user is informed so that the model can be recalibrated. Therefore the user can maintain a larger number of models, as only models in need of readjustment require attention.
Summary
Hopefully this document has been helpful to understand the supposed magic behind Automated Analytics and how strong predictive models can be produced without the user having to be a trained statistician.
It is a very comprehensive process that finds the best model for the situation. The predictive models can be mass-produced and deployed, ensuring that models are easily available where needed. Automation in combination with the ability to interpret trained models ensures that users are fully informed and in control.