SAP Datasphere is a comprehensive data orchestration platform, with a graphical user interface that also addresses less technical business users. SAP HANA Cloud is embedded into SAP Datasphere, which gives SAP Datasphere access to its multi-model capabilities, including the built-in Machine Learning.
This blog shows how to implement a SAP Business AI project with a time-series forecast by using the embedded Predictive Analysis Library. Trigger a forecast from your favourite Python environment without having to extract the data from SAP HANA Cloud, thanks to our Python package hana_ml. Avoid data duplication (keeping your architecture lean) and benefit from the in-memory power.
The whole code of this project can be downloaded from this repository.
Should you prefer to work with a native / stand-alone version of SAP HANA Cloud, then you may prefer to follow this blog instead.
This blog was put together by @AndreasForster and @YannickSchaper
We have chosen to go through a time-series forecasting example, as this is an extremely common requirement for the most diverse projects. Many standard applications from SAP already have time-series forecasting built in, such as SAP Analytics Cloud or SAP Integrated Business Planning. If you have a requirement, that goes beyond a standard application, you can leverage the SAP Business Technology Platform (BTP), ie SAP Datasphere and its embedded SAP HANA Cloud. Alternatively you can also work with a stand-alone SAP HANA Cloud instead, then please see this blog.
In our case now we would like to forecast how many nights foreign tourists or business travellers will spend in Switzerland in the coming months. We are interested in the total numbers, but we also want to know how many nights different nationalities will spend in the country.
Such forecasts of overnight stays can be useful for many purposes, for instance to help decide where to spend a Marketing budget. It can also help the hotel industry to have the right number of staff (and languages) working through the year.
These might just be examples we invented ourselves, based on data that is publicly available (thanks to the opendata.swiss portal!), but such time-series forecasts are used extremely often in the business world. We have used time-series forecasting in many diverse projects, from forecasting the Net Working Capital for Treasury departments to solar power production at a utilities company.
All these use cases have in common, that in such a BTP project an expert (you!) creates a forecast that is then shared with the business users, who might not want to know all technical details of how the forecasts were created.
To follow this blog's instructions, you need to have access to:
The overall architecture for this implementation is centered around SAP Datasphere.
Currently the Predictive Analysis Library provides all these algorithms, and the list keeps growing.
Let's configure SAP Datasphere, so that we can use the embedded Machine Learning.
Step 1: Activate the Script server
The script server must be activated in the SAP Datasphere's Tenant Configuration, please see this SAP Note. This should only take a few seconds to take effect. The free version of SAP Datasphere do not have sufficient virtual CPUs to activate the Script Server. Size your productive system to have at least a minimum of 3 vCPUs and you can use the Machine Learning.
Step 2: Create a Database User
The Machine Learning is triggered by a Space's Database User, hence create a Database User in the Space Management (our space is called AISPACE).
You need to tick the options
Do not forget to click "Deploy" after the user was created. Only after the deployment you can open the new users info dialog (little "i" symbol on the right) from where you get the logon credentials. Note down the
Make sure to keep the password somewhere safe. If you forget the passwird, you cannot get it back. You would need to request a new password instead.
Step 3: Allow IP address
Your Python environment's external IP address must be added to SAP Datasphere's allow list. Otherwise your attempt to connect from Python will be denied. In the System Configuration go to the "IP Allowlist" and add your external IP address. As a quick test you can consider allow all IP addresses by adding the range "0.0.0.0/0" to the allow list. Don't forget to remove this later though. Further down we show how to identify your exact external IP address. Adding this to the allowlist would obviously be more secure.
It might take a few seconds for any changes to this list to take effect.
Use a Python environment of your choice to be able to use Jupyter Notebooks. There are many options, ie with SAP (Business Application Studio) or non-SAP environments. The following steps show how to run Jupyter Notebooks on your local laptop by using the free environment miniconda.
Download and install miniconda. The easiest option is probably the graphical installer.
Open the "Anaconda Prompt" that was installed by minconda. On Windows just use the Operating System's "Search" box to find the prompt.
miniconda allows you to have multiple Python environments side by side. Making changes in one of these environments does not affect the others. This can be very useful to run tests in one environment without interfering with your productive code. Enter the following code into the Anaconda Prompt, execute each line individually. These statements:
conda create --name sandbox python=3.12
conda activate sandbox
pip install jupyterlab
python -m ipykernel install --user --name=sandbox
jupyter lab
Executing these commands will open the Jupyter Lab environment, in which individual Jupyter Notebooks can be executed. Notice the "sandbox" icon in the Notebook section. This icon appears because of the earlier "python -m ipykernel install --user --name=sandbox" command.
Click the "sandbox" icon in the Notebook section and a Jupyter Notebooks opens up, in which you can script your Python code. At the top right you can see that the Notebook is running in the "sandbox" environment. Eventually you might have a number of Python environments, hence it is good to know for sure, in which of these environments the code is executed.
In case you haven't worked with Jupyter Notebooks yet, this introduction to JupyterLab gets you started. Once you are familiar with the Jupyter interface, install the hana_ml package, that we will use to trigger the embedded Machine Learning. Enter this code into a cell in the Notebook and execute it, for example with the "Run" icon on top.
!pip install hana_ml
This downloads and install the hana_ml-package with its dependencies, which can take a few seconds. Once complete, run this code in a new cell, to see the version of the freshly installed package.
import hana_ml
print(hana_ml.__version__)
After a successful installation you see a version such as "2.21.24071200".
Later on we will also need the matplotlib and flask packages when charting the numbers. Run this command in a new cell.
!pip install matplotlib
And also this command in another cell.
!pip install flask
We are ready for the fun part, everything should be ready to upload some data and create some time-series forecasts. Remember, that all code and data can be downloaded from this repository.
Begin by uploading the historic data, that we use to create a time-series forecast. Go into your space in SAP Datasphere. In the Data Builder use the "Import CSV File" option to upload the the file OVERNIGHTSTAYS.csv as local table.
Keep the defaults when clicking through the steps to load the data. You will see a preview with its four columns.
Click "Deploy". Keep the default table name "OVERNIGHTSTAYS". Click "Deploy" again and the data is in a local table in SAP Datasphere.
So the data is in SAP Datasphere, but to make it visible and usable for our Database user, the data needs to be provided as a view, which is exposed for consumption.
For many cases this would be sufficient to use the embedded Machine Learning on this data. In our case however the column OVERNIGHTSTAYS is of type Integer64, which becomes a BIGINT in the underlying SAP HANA Cloud. Since the Machine Learning algorithm that we want to use does not support BIGINT, we change the variable's type to Integer.
Now deploy the view and all is in place to use the Machine Learning.
You find this section's code in the Notebook 010 Time-series forecast.ipynb.
Begin by logging on to SAP Datasphere's embedded SAP HANA Cloud, through the hana_ml package in a Jupyter Notebook. Use the credentials of the database user, the address is the database user's host name.
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='YOURENDPOINT',
port=443,
user='AISPACE#AIUSER',
password='YOURSECRETPASSWORD',
)
conn.connection.isconnected()
Create a hana_ml-DataFrame that points to the view that exposes the data. Here you specify the view's technical name (not the business name). This step doesn't download any data. We minimise data extraction and let SAP HANA Cloud instead do the data processing.
df_remote = conn.table('V_OVERNIGHTSTAYS', schema='AISPACE')
Usually you would first spend some time exploring data. We keep this part in the tutorial very short, but we encourage you to spend sufficient time on data exploration in your own projects.
For now retrieve a few rows of data, to make sure that there are no obvious problems (ie empty rows). The head()-method first restricts the dataset to avoid downloading unnecessary data. The collect()-method then downloads the data into Python (as Pandas DataFrame) to display the data.
df_remote.head(5).collect()
Verify the column's data types. The conversion of the OVERNIGHTSTAYS column to Integer was successful.
df_remote.dtypes()
Get a quick feel for the content of the tables. The describe()-method instructs SAP HANA Cloud to calculate for instance how many values are missing per column, or what the min / max / median values are for numerical columns.
df_remote.describe().collect()
Each of the values above was calculated in SAP HANA Cloud. Because the describe()-method was called, hana_ml created the required SELECT statement, which you can access as well. You can execute this SQL-statement also in other environments to retrieve the results.
print(df_remote.describe().select_statement)
You can also get a more graphical overview of the dataset in a canned report.
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(df_remote).build().display()
We move to the data preparation for our use case. We want to create a single forecast of all overnight stays by month. Hence we need to aggregate all detailed values by month. The hana_ml-DataFrame has a method for that aggregation. Remember, that this aggregation is also virtual. Essentially, it is adding a GROUP-clause to the hana_ml-DataFrame, which is a SELECT statement under the hood. Also look at a few rows of the monthly aggregates.
df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by='MONTH')
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()
Plot the monthly numbers to get a feel for how they have progressed over time.
df_data = df_rem_agg.collect()
import matplotlib.pyplot as plt
df_data.plot(x='MONTH')
plt.xticks( rotation='vertical');
Indeed, there is a nicely repeating pattern. Let's train a time-series model to capture this for a forecast. We used the AdditiveModelForecast algorithm, which is also known as Prophet.
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast()
amf.fit(data=df_rem_agg)
View the trained model in another canned report.
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(amf).build().display()
We have a trained time-series model, but we have not yet created a prediction. To create a forecast, we first need to create a hana_ml DataFrame, that contains the dates/months we would like forecasts for. Hence find out the most recent date of the training history: 2024-05-01 (May 1st 2024).
str_lastdate = df_rem_agg.tail(1, ref_col='MONTH').collect().iloc[0,0]
str_lastdate = str(str_lastdate)[0:10]
print(str_lastdate)
Based on the last known above date/month, create a new hana_ml DataFrame that lists the following 12 months.
from hana_ml.algorithms.pal.random import binomial
months_to_forecast=12
df_rem_future = binomial(conn, n=1, p=1, num_random=months_to_forecast)
df_rem_future = df_rem_future.select('*', (f'''ADD_MONTHS(TO_DATE ('{str_lastdate}', 'YYYY-MM-DD'), ID+1)''', 'MONTH') )
df_rem_future = df_rem_future.select('MONTH', ('0', 'TARGET'))
df_rem_future.head(20).collect()
Now apply the trained time-series model to predict the overnight stays in those 12 months.
df_rem_predicted = amf.predict(data=df_rem_future)
df_rem_predicted.head(5).collect()
The canned report for the trained model now also shows the above forecast. You can clearly see how a repeating pattern was identified and used to predict the future dates. Most overnight stays are in the summer (hiking?) but the winter also has its peaks (skiing and snowboarding!).
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(amf).build().display()
To share the forecasts with business users it's often best to combine the training data and forecasts into a single dataset. Here we combine both parts into a single hana_ml DataFrame.
df_rem_predicted = df_rem_predicted.select('MONTH',
('NULL', 'OVERNIGHTSTAYS_SUM'),
('YHAT', 'FORECAST'),
('YHAT_LOWER', 'FORECAST_LOWER'),
('YHAT_UPPER', 'FORECAST_UPPER'))
df_rem_agg = df_rem_agg.select('*',
('NULL', 'FORECAST'),
('NULL', 'FORECAST_LOWER'),
('NULL', 'FORECAST_UPPER'))
df_rem_all = df_rem_predicted.union(df_rem_agg)
df_rem_all.sort('MONTH').tail(5).collect()
Now save the hana_ml DataFrame to a table in SAP Datasphere. We will see further below how this data can be accessed and shared by SAP Analytics Cloud.
You find this section's code in the Notebook 020 Multiple time-series forecasts.ipynb.
In the previous example we aggregated all overnights stays into monthly values and forecasted the following 12 months. You can also create individual forecasts, ie for the countries from which most visitors are coming, or for the different Swiss regions.
You can find a more detailed description of creating individual forecasts for the countries with the largest visitor numbers in the "Hands-on Tutorial: Machine Learning with SAP HANA Cloud", which uses the same dataset.
But we also give a high level overview here of the most important steps. For such individual forecasts the data needs to be aggregated by month as well as country.
df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by=['MONTH', 'COUNTRYOFRESIDENCE'])
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()
When training the model, you can pass the Country as group_key parameter. For each country an individual time-series model is trained. Therefore each prediction is then unique to each country.
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(massive=True)
amf.fit(data=df_rem_agg, group_key='COUNTRYOFRESIDENCE')
Now that the forecasts are available as table in SAP Datasphere, you can share them with the end users through SAP Analytics Cloud. Create a Fact view on top of the table with the forecasts. You find that table under Sources → Connections → AISPACE#AIUSER.
Create an Analytical Model on top of this Fact view. The four numerical columns need to be moved into the "Measures" section.
And SAP Analytics Cloud can display the data!
The standard line chart in SAC cannot display a shaded area in between the upper and lower bounds of the forecast period. One way of getting such a chart is with R.
library(plotly)
df_data <- OVERNIGHTSTAYS_FORECAST
df_data$MONTH = as.Date(df_data$MONTH, "%b %d,%Y");
df_data <- df_data[order(df_data$MONTH),]
fig <- plot_ly(df_data, x=~MONTH, y=~OVERNIGHTSTAYS_SUM, mode='line', type='scatter', name='Overnight stays', line=list(color='rgb(205, 92, 92)'))
fig <- fig %>% add_trace(y=~FORECAST_LOWER, name='Lower bound', type='scatter', mode='lines', fillcolor='rgb(255, 160, 122)', line=list(color='rgb(205, 92, 92)'))
fig <- fig %>% add_trace(y=~FORECAST_UPPER, name='Upper bound', type='scatter', mode='lines', fill='tonexty', fillcolor='rgb(255, 160, 122)', line=list(color='rgb(205, 92, 92)'))
fig <- fig %>% add_lines(x=~MONTH, y=~FORECAST, name='Forecast', line=list(color='rgb(205, 92, 92)'))
fig <- fig %>% layout(showlegend=FALSE)
fig
We are confident, that you have a better idea for a nice color scheme on that chart.
This final chapter has a few ideas on how to continue from the above.
To avoid having to use the Database user and its password in clear text, you can also refer to logon credentials kept in the Secure User Store from the SAP HANA client. This command stores the credentials under the key "MYDATASPHERE". Because of the "-i" parameter you will be prompted for the password. This ensures that the password does not become visible in the prompt history.
C:\Program Files\SAP\hdbclient>hdbuserstore -i SET MYDATASPHERE "YOURSQLENDPOINT:443" AISPACE#AIUSER
You can use this key to logon to SAP Datasphere.
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(userkey='MYDATASPHERE')
conn.connection.isconnected()
You can automate the creation of new forecasts by scheduling the hana_ml logic. One option is described in the blog Scheduling Python code on Cloud Foundry.
In the above steps we kept it simple and allowed all IP addresses to connect to SAP Datasphere. See the SAP Datasphere Administration documentation on allowing only specific IP addresses or ranges.
An easy option to determine your Python environment's external IP address is a small piece of code like:
from requests import get
ip = get('https://api.ipify.org').text
print(f'My public IP address is: {ip}')
Providing additional context can increase the forecast accuracy. One customer is using SAP's factory calendar table TFACS to flag working days in a daily forecast. For our monthly data of overnight stays in Switzerland, the exchange rates of the different currencies to the Swiss Franc could also be useful for example. See the holiday and exog parameters of the fit()-method to pass such context to the algorithm.
We hope this hands on blog post enables you to supercharge your SAP Datasphere instance with Machine Learning. Transform your SAP data into actionable insights and harness the full potential of HANA ML!
To close off this tutorial, we have a few ideas for you to continue your journey:
🔧Get hands-on: If you have scrolled through the blog, take the time to implement the scenario yourself. You can download the notebooks and learn by doing.
📚Learn more:
💬Join the Discussion: Share your experiences in the comments and connect with us.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.