Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
MHundekari
Associate
Associate
1,036

This blog post is aimed at business users and professionals interested in SAP Analytics Cloud's capabilities with a general interest in or basic knowledge of machine learning. The focus of this blog is on providing practical insights and comparisons, giving an overview without getting into complex details.

 

Let’s start with an analogy: Imagine you are a highly motivated person who just wants to start their fitness journey – but you are not sure how to start. You have two options:

  • Start your fitness journey with the help of a personal trainer, or
  • Go to the gym and try out the equipment on your own.

Ultimately, your satisfaction depends on whether you achieve your fitness goal and how quickly you see results.

Training a classification model using different methods can easily be transferred to this analogy. Think of SAP Analytics Cloud (SAC) as your personal trainer: It has built-in knowledge of classification models with optimized training processes. As your personal ML trainer, it can suggest the best ways to train effectively to achieve your desired results. The alternative is to train a classification model yourself, for example by using open-source libraries such as XGBoost (1). This requires more effort on your part but gives you the flexibility to optimize your training as you like.

Why evaluate at all?

Both approaches have different costs and results that need to be evaluated. As in our analogy, a personal trainer might initially cost more than training on your own but might lead you to faster and easier results.

Customers often want to know why they should invest in SAP Analytics Cloud for Business Intelligence, Predictive Scenarios and other Planning capabilities. They also want to know what the benefits are of using SAC Predictive Scenarios versus implementing a classification model using open-source technology. Because of these diverse reasons, it is important to demonstrate SAC’s unique selling point to attract new customers. From a technical point of view, the main reason for evaluating SAC is to do a performance analysis to check if its classification approach can be considered as state of the art.

Before comparing both approaches and deciding which leads to the best and fastest results, let’s first cover some basics about classification models.

Fundamentals

This part covers the fundamentals and explains the current state of research on classification models. You can skip it, if necessary.

What is the current state of the art?

Before benchmarking SAC, we need to find out what the current state-of-the-art classification model is, depending on the use case. Since we deal with tabular data at SAP, we need to find the state-of-the-art model for this type of data. The type of data is crucial, as each data structure comes with its own characteristics and unique challenges. Research has shown that decision trees, as implemented in XGBoost, are the current state of the art for tabular data (2). Decision trees are the preferred option because of their high performance on tabular data and high explainability through transparency of predictions. This is particularly important in a business context, where decisions or results based on predictions by ML models must be as transparent as possible.

What is the model behind SAC’s Smart Predict?

SAC’s classification model in Smart Predict is based on decision trees using Gradient Boosting. Previous articles have explained the technical functionality of the model in detail and are referenced below for further information (3).

Are Deep Learning Models not a one-size-fits-all solution?

The current hype around Deep Learning Models, Neural Network Architectures and Generative AI suggests that these types of models can be used as a one-size-fits-all solution. Deep Learning Models have shown excellent performance on unstructured data, such as images, audio or text, compared to tabular data. Therefore, Deep Learning Models are considered state of the art for this type of data. Unlike unstructured data, tabular data is structured, containing inherent relationships and dependencies. It also includes mixed data types, null values and outliers. This means that business rules need to be applied to these tables, which need to be learned by the model in order to make predictions. This makes it more difficult for a model to extract business meaning out of it (4). Current research has shown that Deep Learning Models perform poorly on structured tabular data and are therefore not considered as state of the art (5). Nevertheless, there is a lot of research being conducted in this area, as researchers explore alternative architectures, frameworks and tuning techniques for Deep Learning Models.

Evaluating SAC

Having covered the basics, we know that the current state-of-the-art classification models are Decision Trees and are also used in the Smart Predict classification model.

The next step is to train two classification models:

  • Train a model in SAC and
  • Train another one using Gradient Boosting implemented with XGBoost.

For a reasonable evaluation, we need a suitable dataset for tabular data that has been cleared of any data impurities. Benchmark datasets exist for different types of use cases, such as MNIST for handwriting recognition or ImageNet for image classification. My results are based on a benchmark dataset, which is linked down below (6). We then need to define metrics to compare the results of the two implementation techniques. I used common performance metrics such as accuracy, F1-Score and area under the receiver operating curve (ROC-AUC), and user metrics such as ease of use, task success and speed. The user metrics will help us to evaluate the usability of both implementation techniques.

Unbenannt.png

Implementation in Smart Predict

To train a classification model in Smart Predict, we first need to prepare the training and test data. In “Datasets” I uploaded a dataset from a CSV- or Excel file or from any other source in SAC. Ideally, you can split your dataset into training and test data beforehand and upload two separate datasets.

MHundekari_1-1723458125135.png

After uploading, I needed to make some changes to the data, such as checking if the data has been loaded correctly and whether it is assigned to the correct data type. “Dataset Overview” allows you to modify any attribute of the dataset and you can drag and drop to classify attributes as “Measures” or “Dimensions”.

MHundekari_2-1723458146139.png

To build the model, I created a classification model as a new Predictive Scenario by clicking on “Classification” and selecting my modified training dataset as “Training Data Source”. Before training the model with the “Train” button, I set a prediction goal, which must be a Boolean target variable. This can be thought of as telling my personal trainer exactly what my fitness goal is. Optionally, we can exclude certain attributes as influencer variables. For example, we can decide whether our goal is to train certain parts of the body or not. This step won’t be necessary for the evaluation.

MHundekari_4-1723458169267.png

When clicking on “Train”, the model takes a few seconds to train itself and then SAC displays an overview page with Global Performance Indicators, Target Statistics, and Influencer Contributions. This gives more insight into the model’s performance and provides a wider range of explanations.

MHundekari_5-1723458244790.png

 

Implementation with XGBoost

Training a model in Python using the open-source library XGBoost is not as straightforward as training a classification model in SAC. Returning to our gym analogy, when training on our own, we have to start from scratch. The data must be manually pre-processed and split into training and test datasets using libraries such as scikit-learn. Although training the model itself does not take much time and can be done using XGBoost’s extensive documentation, there are some caveats. The first trained model is likely to overfit the training data, making optimization of the first model inevitable. This step of optimization is likely to take up more time than actually training the model and requires knowledge of optimization and choosing the right optimization technique. We can compare this to a beginner going to the gym but making some mistakes in the execution of certain exercises. To correct the mistakes and reduce the risk of injury, we proportionally need to spend more time correcting and optimizing. My results using the benchmark dataset showed that choosing the right technique can be the most influential factor in how much time is spent on this step, apart from the time and effort spent on optimization.

Evaluation results

In terms of performance, Smart Predict and the best implementation of XGBoost delivered almost identical results. For the XGBoost model, a small performance gain can be expected, which requires the expertise of a data scientist.

On the one hand, Smart Predict offers greater convenience by minimizing the manual effort required to train the model. Many steps are automated and optimization is almost instantaneous in the initial training phase. It doesn’t take long from importing the data to having a trained and optimized model. Smart Predict is also user-friendly for business users, so model training is not limited to people with programming skills or a background in data science. It also offers effortless integration, by allowing predictions to be consumed through stories used in SAC.

On the other hand, XGBoost requires programming skills and a deeper understanding of data science, to accurately use the library, implement a classification model and understand performance metrics to avoid overfitting.

Summary

In this blog post, we've explored the classification capabilities of SAP Analytics Cloud using Smart Predict and compared it to open-source alternatives such as XGBoost.

The key takeaways from this evaluation are:

  1. Comparable performance: SAP Analytics Cloud's Smart Predict and well-optimized XGBoost implementations deliver comparable results in terms of model performance metrics. Both techniques use decision trees as their core algorithm, which are considered state of the art for classification tasks. This shows that the Smart Predict classification model is indeed state of the art.
  2. User-friendly interface: SAC provides a more intuitive and automated approach to model training. This makes it particularly accessible to business users who may not have extensive programming or data science expertise.
  3. Efficiency and integration: SAC streamlines the process from data import to model deployment, reducing the time and effort required to develop and deploy predictive models. SAC’s seamless integration enables easy consumption of predictions through its built-in story feature.
  4. Expertise requirements: While XGBoost can offer marginal performance gains in some scenarios, it requires significant programming skills to implement and optimize effectively.
  5. Contextual considerations: The choice between SAC and open-source alternatives depends on several factors, including specific use cases, company size, availability of resources, and in-house expertise.

The evaluation showed that SAC's classification capabilities are on par with state-of-the-art open-source solutions like XGBoost while offering better usability and integration for business users. This makes SAC a compelling option for implementing powerful predictive analytics without the need for extensive data science resources.

However, it's important to note that the choice between SAC and open-source alternatives will vary based on each use case’s unique circumstances and requirements. Just as a personal trainer can guide you to fitness success more efficiently, SAC can lead you to data-driven insights with less effort and greater accessibility. Ultimately, the choice depends on your “fitness goals” and resources.

Thank you for reading and feel free to share your experience or reach out for any further thoughts!

Related links

  1. XGBoost Documentation: https://xgboost.readthedocs.io/en/stable/
    Paper explaining XGBoost: https://dl.acm.org/doi/10.1145/2939672.2939785
  2. Why do tree-based models still outperform deep learning on tabular data?: https://arxiv.org/abs/2207.08815,
    Experimenting XGBoost Algorithm for Prediction and Classification of Different Datasets: https://www.researchgate.net/publication/318132203_Experimenting_XGBoost_Algorithm_for_Prediction_an...
  3. Understanding Classification with Smart Predict: https://community.sap.com/t5/technology-blogs-by-members/understanding-classification-with-smart-pre...
    SAC Smart Predict – What goes under the hood: https://community.sap.com/t5/technology-blogs-by-sap/sac-smart-predict-what-goes-on-under-the-hood/b...
  4. Deep Neural Networks and Tabular Data: A Survey: https://arxiv.org/abs/2110.01889
  5. Tabular data: Deep learning is not all you need: https://www.sciencedirect.com/science/article/abs/pii/S1566253521002360?via%3Dihub
  6. Benchmark Dataset: https://huggingface.co/datasets/inria-soda/tabular-benchmark