SAP Datasphere is a comprehensive data orchestration platform, with a graphical user interface that also addresses less technical business users. SAP HANA Cloud is embedded into SAP Datasphere, which gives SAP Datasphere access to its multi-model capabilities, including the built-in Machine Learning.

This blog shows how to implement a SAP Business AI project with a time-series forecast by using the embedded Predictive Analysis Library. Trigger a forecast from your favourite Python environment without having to extract the data from SAP HANA Cloud, thanks to our Python package hana_ml. Avoid data duplication (keeping your architecture lean) and benefit from the in-memory power.

The whole code of this project can be downloaded from this repository.

Should you prefer to work with a native / stand-alone version of SAP HANA Cloud, then you may prefer to follow this blog instead.

Use case

We have chosen to go through a time-series forecasting example, as this is an extremely common requirement for the most diverse projects. Many standard applications from SAP already have time-series forecasting built in, such as SAP Analytics Cloud or SAP Integrated Business Planning. If you have a requirement, that goes beyond a standard application, you can leverage the SAP Business Technology Platform (BTP), ie SAP Datasphere and its embedded SAP HANA Cloud. Alternatively you can also work with a stand-alone SAP HANA Cloud instead, then please see this blog.

In our case now we would like to forecast how many nights foreign tourists or business travellers will spend in Switzerland in the coming months. We are interested in the total numbers, but we also want to know how many nights different nationalities will spend in the country.

Such forecasts of overnight stays can be useful for many purposes, for instance to help decide where to spend a Marketing budget. It can also help the hotel industry to have the right number of staff (and languages) working through the year.

These might just be examples we invented ourselves, based on data that is publicly available (thanks to the opendata.swiss portal!), but such time-series forecasts are used extremely often in the business world. We have used time-series forecasting in many diverse projects, from forecasting the Net Working Capital for Treasury departments to solar power production at a utilities company.

All these use cases have in common, that in such a BTP project an expert (you!) creates a forecast that is then shared with the business users, who might not want to know all technical details of how the forecasts were created.

Architecture / prerequisites

To follow this blog's instructions, you need to have access to:

A productive instance of SAP Datasphere. Note that the free-tier or free-trial options of SAP Datasphere do not suffice, since these have only 2 virtual CPUs. The embedded Machine however requires the Script Server to be activated, which requires 3 virtual CPUs.
A Python environment. We explain how to set up miniconda, but if you already have an existing Python installation, you should be able to use this as well. Of course, you could also use the Business Application Studio on our Business Technology Platform.
It helps if you have a bit of Python or scripting experience. But we hope this blog can also be a good entry point for someone who is just starting out with Python.
Motivation to get to know SAP HANA's Predictive Analysis Library.

The overall architecture for this implementation is centered around SAP Datasphere.

Currently the Predictive Analysis Library provides all these algorithms, and the list keeps growing.

PAL algos.jpg

SAP Datasphere

Let's configure SAP Datasphere, so that we can use the embedded Machine Learning.

Step 1: Activate the Script server

The script server must be activated in the SAP Datasphere's Tenant Configuration, please see this SAP Note. This should only take a few seconds to take effect. The free version of SAP Datasphere do not have sufficient virtual CPUs to activate the Script Server. Size your productive system to have at least a minimum of 3 vCPUs and you can use the Machine Learning.

Step 2: Create a Database User

The Machine Learning is triggered by a Space's Database User, hence create a Database User in the Space Management (our space is called AISPACE).

You need to tick the options

"Enable Automated Predictive Library (APL) and Predictive Analysis Library (PAL)": This option only shows up, if the Script Sever is enabled.
"Enable Read Access (SQL)": To allow the Database User to access the data.
"Enable Write Access (SQL, DDL, & DML)": Required as many algorithms need to write temporary data to the system, and also to allow the Database User to write further data into the space (ie predictions).

210 create database user.png

Do not forget to click "Deploy" after the user was created. Only after the deployment you can open the new users info dialog (little "i" symbol on the right) from where you get the logon credentials. Note down the

host name
full Database User Name, ie "AISPACE#AIUSER"
password

Make sure to keep the password somewhere safe. If you forget the passwird, you cannot get it back. You would need to request a new password instead.

Step 3: Allow IP address

Your Python environment's external IP address must be added to SAP Datasphere's allow list. Otherwise your attempt to connect from Python will be denied. In the System Configuration go to the "IP Allowlist" and add your external IP address. As a quick test you can consider allow all IP addresses by adding the range "0.0.0.0/0" to the allow list. Don't forget to remove this later though. Further down we show how to identify your exact external IP address. Adding this to the allowlist would obviously be more secure.

230 ip allowlist.png

It might take a few seconds for any changes to this list to take effect.

Python / Jupyter Notebooks

Use a Python environment of your choice to be able to use Jupyter Notebooks. There are many options, ie with SAP (Business Application Studio) or non-SAP environments. The following steps show how to run Jupyter Notebooks on your local laptop by using the free environment miniconda.

Download and install miniconda. The easiest option is probably the graphical installer.

Open the "Anaconda Prompt" that was installed by minconda. On Windows just use the Operating System's "Search" box to find the prompt.

miniconda allows you to have multiple Python environments side by side. Making changes in one of these environments does not affect the others. This can be very useful to run tests in one environment without interfering with your productive code. Enter the following code into the Anaconda Prompt, execute each line individually. These statements:

create a new Python 3.12 environment (which is currently the stable version of Python) named "sandbox"
activate the new "sandbox" environment, so that the following commands are executed for this environment
install the python package jupyterlab, which, you might have guessed, installs the Jupyter Notebook environment into the blank Python environment
add the "sandbox" envrionment to the Jupyter environment (you will see later, what this means)
and finally, open the Jupyter environment.

conda create --name sandbox python=3.12
conda activate sandbox
pip install jupyterlab
python -m ipykernel install --user --name=sandbox
jupyter lab

100 anaconda.png

Executing these commands will open the Jupyter Lab environment, in which individual Jupyter Notebooks can be executed. Notice the "sandbox" icon in the Notebook section. This icon appears because of the earlier "python -m ipykernel install --user --name=sandbox" command.

110 jupyter lab.png

Click the "sandbox" icon in the Notebook section and a Jupyter Notebooks opens up, in which you can script your Python code. At the top right you can see that the Notebook is running in the "sandbox" environment. Eventually you might have a number of Python environments, hence it is good to know for sure, in which of these environments the code is executed.

120 notebook.png

In case you haven't worked with Jupyter Notebooks yet, this introduction to JupyterLab gets you started. Once you are familiar with the Jupyter interface, install the hana_ml package, that we will use to trigger the embedded Machine Learning. Enter this code into a cell in the Notebook and execute it, for example with the "Run" icon on top.

!pip install hana_ml

155 install hana_ml.png

This downloads and install the hana_ml-package with its dependencies, which can take a few seconds. Once complete, run this code in a new cell, to see the version of the freshly installed package.

import hana_ml
print(hana_ml.__version__)

After a successful installation you see a version such as "2.21.24071200".

Later on we will also need the matplotlib and flask packages when charting the numbers. Run this command in a new cell.

!pip install matplotlib

And also this command in another cell.

!pip install flask

Time-series forecasting

We are ready for the fun part, everything should be ready to upload some data and create some time-series forecasts. Remember, that all code and data can be downloaded from this repository.

Data upload

Begin by uploading the historic data, that we use to create a time-series forecast. Go into your space in SAP Datasphere. In the Data Builder use the "Import CSV File" option to upload the the file OVERNIGHTSTAYS.csv as local table.

240 csv upload.png

Keep the defaults when clicking through the steps to load the data. You will see a preview with its four columns.

The month
The Swiss region
The nationality of the guests
And the number of overnight stays

250 data preview.png

Click "Deploy". Keep the default table name "OVERNIGHTSTAYS". Click "Deploy" again and the data is in a local table in SAP Datasphere.

260 table.png

So the data is in SAP Datasphere, but to make it visible and usable for our Database user, the data needs to be provided as a view, which is exposed for consumption.

Create a new "Graphical View"
Drag the table OVERNIGHTSTAYS onto the canvas
Set the Business Name of the view to: V_OVERNIGHTSTAYS
Activate the option "Expose for Consumption"

270 calc view 1.png

For many cases this would be sufficient to use the embedded Machine Learning on this data. In our case however the column OVERNIGHTSTAYS is of type Integer64, which becomes a BIGINT in the underlying SAP HANA Cloud. Since the Machine Learning algorithm that we want to use does not support BIGINT, we change the variable's type to Integer.

Add a "Calculated Columns" step to the Calculation View.
Go into the "Edit" mode of the OVERNIGHTSTAYS column
Change the expression to: TO_INTEGER(OVERNIGHTSTAYS)
Change the Data Type to: Integer

280 calc view 2.png

Now deploy the view and all is in place to use the Machine Learning.

Forecasting a time-series

You find this section's code in the Notebook 010 Time-series forecast.ipynb.

Begin by logging on to SAP Datasphere's embedded SAP HANA Cloud, through the hana_ml package in a Jupyter Notebook. Use the credentials of the database user, the address is the database user's host name.

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='YOURENDPOINT',
                                   port=443, 
                                   user='AISPACE#AIUSER', 
                                   password='YOURSECRETPASSWORD', 
                                  )
conn.connection.isconnected()

Create a hana_ml-DataFrame that points to the view that exposes the data. Here you specify the view's technical name (not the business name). This step doesn't download any data. We minimise data extraction and let SAP HANA Cloud instead do the data processing.

df_remote = conn.table('V_OVERNIGHTSTAYS', schema='AISPACE')

Usually you would first spend some time exploring data. We keep this part in the tutorial very short, but we encourage you to spend sufficient time on data exploration in your own projects.

For now retrieve a few rows of data, to make sure that there are no obvious problems (ie empty rows). The head()-method first restricts the dataset to avoid downloading unnecessary data. The collect()-method then downloads the data into Python (as Pandas DataFrame) to display the data.

df_remote.head(5).collect()

Verify the column's data types. The conversion of the OVERNIGHTSTAYS column to Integer was successful.

df_remote.dtypes()

305 dtypes.png

Get a quick feel for the content of the tables. The describe()-method instructs SAP HANA Cloud to calculate for instance how many values are missing per column, or what the min / max / median values are for numerical columns.

df_remote.describe().collect()

310 describe.png

Each of the values above was calculated in SAP HANA Cloud. Because the describe()-method was called, hana_ml created the required SELECT statement, which you can access as well. You can execute this SQL-statement also in other environments to retrieve the results.

print(df_remote.describe().select_statement)

320 describe select.png

You can also get a more graphical overview of the dataset in a canned report.

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(df_remote).build().display()

330 dataset report.png

We move to the data preparation for our use case. We want to create a single forecast of all overnight stays by month. Hence we need to aggregate all detailed values by month. The hana_ml-DataFrame has a method for that aggregation. Remember, that this aggregation is also virtual. Essentially, it is adding a GROUP-clause to the hana_ml-DataFrame, which is a SELECT statement under the hood. Also look at a few rows of the monthly aggregates.

df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by='MONTH')
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()

340 monthly data.png

Plot the monthly numbers to get a feel for how they have progressed over time.

df_data = df_rem_agg.collect()
import matplotlib.pyplot as plt
df_data.plot(x='MONTH')
plt.xticks( rotation='vertical');

345 monthly data plotted.png

Indeed, there is a nicely repeating pattern. Let's train a time-series model to capture this for a forecast. We used the AdditiveModelForecast algorithm, which is also known as Prophet.

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast()
amf.fit(data=df_rem_agg)

View the trained model in another canned report.

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(amf).build().display()

340 time series report.png

We have a trained time-series model, but we have not yet created a prediction. To create a forecast, we first need to create a hana_ml DataFrame, that contains the dates/months we would like forecasts for. Hence find out the most recent date of the training history: 2024-05-01 (May 1st 2024).

str_lastdate = df_rem_agg.tail(1, ref_col='MONTH').collect().iloc[0,0]
str_lastdate = str(str_lastdate)[0:10]
print(str_lastdate)

Based on the last known above date/month, create a new hana_ml DataFrame that lists the following 12 months.

from hana_ml.algorithms.pal.random import binomial
months_to_forecast=12
df_rem_future = binomial(conn, n=1, p=1, num_random=months_to_forecast)
df_rem_future = df_rem_future.select('*', (f'''ADD_MONTHS(TO_DATE ('{str_lastdate}', 'YYYY-MM-DD'), ID+1)''', 'MONTH') )
df_rem_future = df_rem_future.select('MONTH', ('0', 'TARGET'))
df_rem_future.head(20).collect()

350 future dates.png

Now apply the trained time-series model to predict the overnight stays in those 12 months.

df_rem_predicted = amf.predict(data=df_rem_future)
df_rem_predicted.head(5).collect()

360 forecasts.png

The canned report for the trained model now also shows the above forecast. You can clearly see how a repeating pattern was identified and used to predict the future dates. Most overnight stays are in the summer (hiking?) but the winter also has its peaks (skiing and snowboarding!).

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(amf).build().display()

370 time series report 2.png

To share the forecasts with business users it's often best to combine the training data and forecasts into a single dataset. Here we combine both parts into a single hana_ml DataFrame.

df_rem_predicted = df_rem_predicted.select('MONTH',
                                          ('NULL', 'OVERNIGHTSTAYS_SUM'),
                                          ('YHAT', 'FORECAST'),
                                          ('YHAT_LOWER', 'FORECAST_LOWER'),
                                          ('YHAT_UPPER', 'FORECAST_UPPER'))
df_rem_agg = df_rem_agg.select('*', 
                               ('NULL', 'FORECAST'),
                               ('NULL', 'FORECAST_LOWER'),
                               ('NULL', 'FORECAST_UPPER'))
df_rem_all = df_rem_predicted.union(df_rem_agg)
df_rem_all.sort('MONTH').tail(5).collect()

380 combined data.png

Now save the hana_ml DataFrame to a table in SAP Datasphere. We will see further below how this data can be accessed and shared by SAP Analytics Cloud.

Forecasting multiple time-series

You find this section's code in the Notebook 020 Multiple time-series forecasts.ipynb.

In the previous example we aggregated all overnights stays into monthly values and forecasted the following 12 months. You can also create individual forecasts, ie for the countries from which most visitors are coming, or for the different Swiss regions.

You can find a more detailed description of creating individual forecasts for the countries with the largest visitor numbers in the "Hands-on Tutorial: Machine Learning with SAP HANA Cloud", which uses the same dataset.

But we also give a high level overview here of the most important steps. For such individual forecasts the data needs to be aggregated by month as well as country.

df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by=['MONTH', 'COUNTRYOFRESIDENCE'])
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()

400 data aggregation.png

410 visitors by country.png

When training the model, you can pass the Country as group_key parameter. For each country an individual time-series model is trained. Therefore each prediction is then unique to each country.

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(massive=True)
amf.fit(data=df_rem_agg, group_key='COUNTRYOFRESIDENCE')

Sharing forecasts with SAP Analytics Cloud

Now that the forecasts are available as table in SAP Datasphere, you can share them with the end users through SAP Analytics Cloud. Create a Fact view on top of the table with the forecasts. You find that table under Sources → Connections → AISPACE#AIUSER.

500 fact view.png

Create an Analytical Model on top of this Fact view. The four numerical columns need to be moved into the "Measures" section.

510 Analytical model.png

And SAP Analytics Cloud can display the data!

550 sac.png

The standard line chart in SAC cannot display a shaded area in between the upper and lower bounds of the forecast period. One way of getting such a chart is with R.

library(plotly)
df_data <- OVERNIGHTSTAYS_FORECAST
df_data$MONTH = as.Date(df_data$MONTH, "%b %d,%Y");
df_data <- df_data[order(df_data$MONTH),]
fig <- plot_ly(df_data, x=~MONTH, y=~OVERNIGHTSTAYS_SUM, mode='line', type='scatter', name='Overnight stays', line=list(color='rgb(205, 92, 92)')) 

fig <- fig %>% add_trace(y=~FORECAST_LOWER, name='Lower bound', type='scatter', mode='lines', fillcolor='rgb(255, 160, 122)', line=list(color='rgb(205, 92, 92)'))
fig <- fig %>% add_trace(y=~FORECAST_UPPER, name='Upper bound', type='scatter', mode='lines', fill='tonexty', fillcolor='rgb(255, 160, 122)', line=list(color='rgb(205, 92, 92)'))
fig <- fig %>% add_lines(x=~MONTH, y=~FORECAST, name='Forecast', line=list(color='rgb(205, 92, 92)'))
fig <- fig %>% layout(showlegend=FALSE)
fig

560 sac with r.png

We are confident, that you have a better idea for a nice color scheme on that chart.

Going Further

This final chapter has a few ideas on how to continue from the above.

Secure logon

To avoid having to use the Database user and its password in clear text, you can also refer to logon credentials kept in the Secure User Store from the SAP HANA client. This command stores the credentials under the key "MYDATASPHERE". Because of the "-i" parameter you will be prompted for the password. This ensures that the password does not become visible in the prompt history.

C:\Program Files\SAP\hdbclient>hdbuserstore -i SET MYDATASPHERE "YOURSQLENDPOINT:443" AISPACE#AIUSER

You can use this key to logon to SAP Datasphere.

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(userkey='MYDATASPHERE')
conn.connection.isconnected()

Scheduling

You can automate the creation of new forecasts by scheduling the hana_ml logic. One option is described in the blog Scheduling Python code on Cloud Foundry.

Restrict IP addresses

In the above steps we kept it simple and allowed all IP addresses to connect to SAP Datasphere. See the SAP Datasphere Administration documentation on allowing only specific IP addresses or ranges.

An easy option to determine your Python environment's external IP address is a small piece of code like:

from requests import get
ip = get('https://api.ipify.org').text
print(f'My public IP address is: {ip}')

Improve forecast accuracy with external variables

Providing additional context can increase the forecast accuracy. One customer is using SAP's factory calendar table TFACS to flag working days in a daily forecast. For our monthly data of overnight stays in Switzerland, the exchange rates of the different currencies to the Swiss Franc could also be useful for example. See the holiday and exog parameters of the fit()-method to pass such context to the algorithm.

Summary and next steps

We hope this hands on blog post enables you to supercharge your SAP Datasphere instance with Machine Learning. Transform your SAP data into actionable insights and harness the full potential of HANA ML!

To close off this tutorial, we have a few ideas for you to continue your journey:

🔧Get hands-on: If you have scrolled through the blog, take the time to implement the scenario yourself. You can download the notebooks and learn by doing.

📚Learn more:

Go through SAP HANA ML examples and try this other hands-on tutorial, here using a classification
Explore the documentation of Machine Learning in SAP HANA Cloud and the hana_ml Python package
Join one of the regular HANA ML code sessions. You can check here or reach out directly to Witalij Rudnicki.
Should you happen to speak German, there is a whole book on "Data Science mit SAP HANA"

💬Join the Discussion: Share your experiences in the comments and connect with us.

albertosimeoni · ‎2024 Sep 11

Hello,

for me a very big limitation is:

To schedule the prediction phase you mention Cloud foundry

=> you need a BTP account and a server setup to be used only for this scheduling? (costs to add a scheduling functionality?? it smell like drawbacks surpass advantages...)

Since Datasphere can run scheduled stored procedures.

and you can create stored procedures that uses PAL / APL procedures in database explorer.

Can a stored procedure / Table Function be used in combination with Python HANA_ML ?
Can I save the prediction "algorithm" be used into a stored procedure to be scheduled in datasphere task chains (or better in a table function)?

Best Regards,

Alberto

AndreasForster · ‎2024 Sep 11

Hi @albertosimeoni, Some HANA ML logic can indeed be embedded into Datasphere table functions, see https://community.sap.com/t5/technology-blogs-by-sap/embedding-hana-machine-learning-models-into-sap... Please note the comments in that blog's summary though on limitations / alternatives.

This customer influence idea should also be relevant for your question. Please add your name to the list if you think this would be useful (and I agree, it would indeed be very valuable to have this implemented).

Cloud Foundry is just an example that I know works well for a number of customers. It's lightweight and comes at low cost for this scheduling scenario. The Discovery Center gives a pricing estimate for

SAP BTP, Cloud Foundry Runtime
SAP Job Scheduling Service

I know of some customer using their operation system's scheduler. But you can also use AI Core or really any environment that can schedule Python code.

Hands-on Tutorial: Machine Learning with SAP Datasphere

Table of contents