This blog post describes how SAP Datasphere can be used to provide a seamless data science experience by facilitating the training of machine learning (ML) models on different platforms (e.g. using hana_ml in SAP HANA and using FedML on hypercaler landscapes). Furthermore, it shows how those ML models can work hand-in-hand to provide data-driven insights to business users without the need of expensive data replication and fully preserving the business context of the data. To illustrate the point, a real-life use case is reviewed and the selection of an ML runtime is discussed in the context of data gravity, availability of the required ML tools on the platform and business criticality of the data. The objective of this blog post is to provide a high level concept and consideration guidelines for data scientists and architects when working on similar multi-cloud cases.
Background
Many platforms offer a variety of excellent ML tools and technologies to data scientists. Those tools usually perform best if the data being processed is also available on the same platform. In most cases though, organizations need to perform analytics on data distributed across multiple landscapes. To do so, they oftentimes copy and replicate the data to a single location in the before-mentioned platforms. While in several cases, even though expensive and time-consuming, this might be an acceptable approach, when it comes to business data, it quickly becomes more complicated. This data is, on one side, business critical and extraction and copying poses a risk on its own. On the other side, moving it away from the source system (e.g. SAP S/4HANA) removes the business context and semantics and it is not guaranteed that it will be up-to-date at the moment of consumption.
SAP Datasphere helps overcome those challenges, by allowing users to connect and manage all their data in real time, across different systems and applications. When it comes to ML, SAP Datasphere offers two approaches which provide flexibility to data scientists and help avoid data replication:
- HANA Machine Learning Library (hana_ml) - Using hana_ml data scientists do not need to extract business relevant data outside of SAP systems, since the library provides a Python and an R interface to the embedded ML libraries in SAP HANA. Those libraries (PAL - Predictive Analysis Library and APL - Automated Predictive Library) offer state of the art ML algorithms directly at the data location. More information about hana_ml can be found in this blog post.
- Federated-ML Library (FedML) - This library applies the Data Federation with SAP Datasphere and provides tools for data scientists to build, train and deploy machine learning models on hyperscaler platforms. At the same time the need for migration and replication of data out of its source is eliminated. Check out this blog post for more details.
Use case description
So, let's review a real life use case from a fictitious machinery manufacturer called Best Run GmbH. The company has decided to augment their hardware business with an offer called Equipment-as-a-Service (EaaS). EaaS allows customers of Best Run GmbH to purchase an operational KPI for an equipment, guaranteed with a service-level agreement, SLA, rather than the equipment itself (e.g. X hours of fault-free operation of the equipment, etc.).
Responsibilities in the Equipment-as-a-Service (EaaS) model
As shown in the figure above, EaaS services transfer the responsibility for fault-free operations of the equipment back to the manufacturer. This adds complexity and risks for manufacturers, since those are financially liable for not fulfilling SLAs, e.g. due to unplanned maintenance events. For that reason, Best Run GmbH decides to develop and deploy ML models to forecast the SLA compliance and fulfillment risk for each of their customers in advance using machine learning.
Data distribution
To train such machine learning models, usually both business (IT) as well as operations (OT) data need to be considered. Those data types have different properties, as described in the following summary:
- Business / IT data, e.g. maintenance orders, asset master, etc.
- business critical
- data gravity usually in an SAP system
- structured with complex semantics
- Operations Technology, OT data, e.g. inspection data, sensors, etc.
- high volume
- lower business criticality
- data gravity in a cheap storage
- usually unstructured
In the figure below we see an example distribution of the data in the landscape of Best Run GmbH.
Data distribution at Best Run GmbH
On the IT side, the SAP world, the business applications are shown. The data typically contains information about suppliers, recipes, maintenance activities as well as quality information. On the right is the hyperscaler platform of Best Run GmbH - in this case Google Cloud. The company stores unstructured data from the machine maintenance in the data lake: e.g. image data, inspection logs, etc. as well as sensor and device data.
Machine Learning
Data scientists in Best Run GmbH decide to train two ML models - one to predict unplanned maintenance events and another one to forecast the risk score for SLA compliance. Since data from several systems is required, they decide to use SAP Datasphere to avoid unnecessary data replication. By doing so, they minimizes the risk of high cost and manual efforts, as well as avoid inconsistency and compliance issues. Let's now review what factors need to be considered in order to pick the best approach for training of each of those models.
ML Model #1: Time-to-maintenance prediction (trained in Google Vertex.AI)
ML Model #1: Time-to-maintenance prediction
In order to predict the SLA compliance, the possibility for an unplanned maintenance needs to be evaluated. The data scientists of Best Run GmbH use a deep-learning model to forecast time-to-maintenance events. After they have identified the data required for model training, they evaluate its properties:
- Most of the image and sensor data is located in the Google Storage service (data gravity aspect)
- The model uses deep-learning to do the forecasting (tools availability aspect)
- Business criticality of the data is low (only sensors and images)
In addition to the OT data, some data from an SAP system, e.g. past maintenance activities, also needs to be considered by the model. Since the majority of the data is already in the Google Platform and it offers deep learning algorithms out-of-the-box, Best Run GmbH decides to use Vertex.AI for the training. Technically this means, that to avoid the data replication and take full advantage of the federation functionalities of SAP Datasphere, they select the FedML library. With the library, they are able to trigger and perform the training "on the fly" automatically federating the required SAP data into the Google service temporary for the duration of the training. It is important to mention that this approach is feasible if the size of the data permits such operations.
During inference, the results of the model are stored back into SAP Datasphere.
ML Model #2: SLA compliance risk score prediction (trained in SAP HANA Cloud)
ML Model#2: SLA compliance risk score prediction
Once the prediction of time to unplanned maintenance events is available, the ML model for risk score for the SLA compliance forecasting can be trained. SLAs for fault-free operations of equipment depend not only on unplanned maintenance events but also on status of spare parts on stock, planned production runs, product recipes in the the production backlog, etc. Such information is usually scattered across several business applications in the SAP landscape. After the data scientists have identified which data is required for the model to be trained, they again evaluate its properties as for ML#1:
- Most of this data is located in an SAP system (data gravity aspect)
- A regression-based model for tabular data is selected (tools available embedded in SAP HANA)
- Business criticality of the data is high (important to keep the data at source, avoid replication, single source of truth aspect, etc.)
Because of the points above, the data scientists chose hana_ml for training, since this way no data movement is performed and the business data remains in its original location.
The model relies in addition also on the results of ML#1, which can now be combined with the business data easily within SAP Datasphere.
Business insights and more
Finally, based on the calculated risk score, a dashboard can be designed to inform the EaaS contract responsible if the ML model predicts an increase of the SLA compliance risk score. There are several further functionalities, which can be added to enhance the reaction of the manufacturer even more. For example, after the risk score for SLA compliance is calculated:
- A recommender model can be used to propose adequate actions (based on past resolutions applied by the technicians) to mitigate the risk and reduce contract non-compliance;
- A hyperautomation routine can be triggered to autonomously trigger actions, request technicians, order spare parts, etc.
SLA compliance dashboard based on the combined result of the ML models
Summary
In this example you have learnt how FedML and hana_ml libraries can be used to train machine learning algorithms in multi-cloud environments, eliminating the need to migrate and replicate data out of its original location. This means that the modeling can be performed completely in SAP Datasphere, applying technologies and algorithms where the data gravity is concentrated, using tools where available and at the same time preserving the data integrity and context, following single-source of truth paradigm. This provides flexibility to data scientists to not be locked with only one vendor and at the same time limits the risks to lose business context for data.
When in comes to choice of an ML tool and library, you have also seen an example on which factors could be evaluated in order to make the decision easier.
Even though in the example Google Cloud was shown, the concept is valid and can be implemented also on Microsoft Azure, AWS and Databricks.
Happy reading!