Since SAP HANA Cloud is a multi-model database management system, it can store, process and enrich different kinds of data (ie tables, geo-spatial, graph) with its in-built engines. This blog shows how to implement a SAP Business AI project with a time-series forecast by using the embedded Predictive Analysis Library.
Trigger a forecast from your favourite Python environment without having to extract the data from SAP HANA Cloud, thanks to our Python package hana_ml. Avoid data duplication (keeping your architecture lean) and benefit from the in-memory power.
The whole code of this project can be downloaded from this repository.
FYI, the new blogging platform on the SAP Community now allows for multiple authors to work on the same document. That's our first attempt, and we ourselves at least like the outcome 😀 @AndreasForster and @YannickSchaper
We have chosen to go through a time-series forecasting example, as this is an extremely common requirement for the most diverse projects. Many standard applications from SAP already have time-series forecasting built in, such as SAP Analytics Cloud or SAP Integrated Business Planning. If you have a requirement, that goes beyond a standard application, you can leverage the SAP Business Technology Platform (BTP), ie SAP HANA Cloud.
In our case now we would like to forecast how many nights foreign tourists or business travellers will spend in Switzerland in the coming months. We are interested in the total numbers, but we also want to know how many nights different nationalities will spend in the country.
Such forecasts of overnight stays can be useful for many purposes, for instance to help decide where to spend a Marketing budget. It can also help the hotel industry to have the right number of staff (and languages) working through the year.
These might just be examples we invented ourselves, based on data that is publicly available (thanks to the opendata.swiss portal!), but such time-series forecasts are used extremely often in the business world. We have used time-series forecasting in many diverse projects, from forecasting the Net Working Capital for Treasury departments to solar power production at a utilities company.
All these use cases have in common, that in such a BTP project an expert (you! 😀 ) creates a forecast that is then shared with the business users, who might not want to know all technical details of how the forecasts were created.
To follow this blog's instructions, you need to have access to:
So the overall architecture for this implementation is very straight forward.
Currently the Predictive Analysis Library provides all these algorithms, and the list keeps growing.
First things first, you need a SAP HANA Cloud system, in which to execute the Machine Learning. This could be a productive system that you already have access to. You can also use a free-tier or trial instance of SAP HANA Cloud. To allow everyone to follow the steps hands-on, we are using a free-tier option with Pay-As-You-Go account here.
If you wonder what the differences are between free-tier and trial, then this tutorial should help you out. With a paid instance of SAP HANA Cloud the setup of the Machine Learning is slightly differently, you need to activate the script server.
Remember that productive SAP Datasphere systems also have the same Machine Learning built-in, thanks to the embedded SAP HANA Cloud. We have released a separate blog for this.
So let's start working with SAP HANA Cloud as free-tier. If you haven't got a system yet, you can follow the below tutorials. Just be aware, that not all BTP components are available in all regions. Make sure to have your BTP subaccount in a region, in which the SAP HANA Cloud free-tier is available. See the Discovery Center for options. We are using "Europe (Netherlands)" on Microsoft Azure.
Once SAP HANA Cloud is running, take note of the instance's SQL endpoint, which is needed to logon to the system from Python.
Now create a Database User for our project. You can execute the code for example in the SAP HANA database explorer, as explained in: Tools to Manage and Access the SAP HANA Cloud, SAP HANA Database
Execute these commands as DBADMIN in the Database Explorer. Just replace YOURSECRETPASSWORD with literally your secret password (the one you had to enter when creating the instance). This creates a user called AIUSER, whose password doesn't expire. This user is also assigned the necessary rights to trigger the Predictive Analysis Library.
CREATE USER AIUSER PASSWORD YOURSECRETPASSWORD NO FORCE_FIRST_PASSWORD_CHANGE;
ALTER USER AIUSER DISABLE PASSWORD LIFETIME;
GRANT AFL__SYS_AFL_AFLPAL_EXECUTE TO AIUSER;
GRANT AFLPM_CREATOR_ERASER_EXECUTE to AIUSER;
Use a Python environment of your choice to be able to use Jupyter Notebooks. There are many options, ie with SAP (Business Application Studio) or non-SAP environments. The following steps show how to run Jupyter Notebooks on your local laptop by using the free environment miniconda.
Download and install miniconda. The easiest option is probably the graphical installer.
Open the "Anaconda Prompt" that was installed by minconda. On Windows just use the Operating System's "Search" box to find the prompt.
miniconda allows you to have multiple Python environments side by side. Making changes in one of these environments does not affect the others. This can be very useful to run tests in one environment without interfering with your productive code. Enter the following code into the Anaconda Prompt, execute each line individually. These statements:
conda create --name sandbox python=3.12
conda activate sandbox
pip install jupyterlab
python -m ipykernel install --user --name=sandbox
jupyter lab
Executing these commands will open the Jupyter Lab environment, in which individual Jupyter Notebooks can be executed. Notice the "sandbox" icon in the Notebook section. This icon appears because of the earlier "python -m ipykernel install --user --name=sandbox" command.
Click the "sandbox" icon in the Notebook section and a Jupyter Notebooks opens up, in which you can script your Python code. At the top right you can see that the Notebook is running in the "sandbox" environment. Eventually you might have a number of Python environments, hence it is good to know for sure, in which of these environments the code is executed.
In case you haven't worked with Jupyter Notebooks yet, this introduction to JupyterLab gets you started. Once you are familiar with the Jupyter interface, install the hana_ml Package, that we will use to trigger the embedded Machine Learning. Enter this code into a cell in the Notebook, then execute the code, for example with the "Run" icon on top.
!pip install hana_ml
This downloads and install the hana_ml-package with its dependencies, which can take a few seconds. Once complete, run this code in a new cell, to see the version of the freshly installed package.
import hana_ml
print(hana_ml.__version__)
Later on we will also need the matplotlib and flask packages when charting the numbers. Run this command in a new cell.
!pip install matplotlib
And also this command in another cell.
!pip install flask
We are ready for the fun part, everything should be ready to upload some data and create some time-series forecasts. Remember, that all code can be downloaded from this repository. The notebooks that you find there are a little more elaborate than then blog.
You find this section's code in the Notebook "010 Data upload.ipynb".
Begin by logging on to SAP HANA Cloud, through the hana_ml package in a Jupyter Notebook. Use the credentials of the AIUSER.
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='YOURENDPOINT',
port=443,
user='AIUSER',
password='YOURSECRETPASSWORD',
)
conn.connection.isconnected()
Now bring the data that we want to upload from this CSV file into a Pandas DataFrame in Python. Display a few rows of the data.
import pandas as pd
df_data= pd.read_csv('OVERNIGHTSTAYS.csv')
df_data.MONTH = pd.to_datetime(df_data.MONTH, format='%d/%m/%Y')
df_data.head(5)
The table contains of four columns
Then upload the Pandas DataFrame through the hana_ml package into a table in SAP HANA Cloud. The table and its columns are created automatically on the fly.
import hana_ml.dataframe as dataframe
df_remote = dataframe.create_dataframe_from_pandas(connection_context=conn,
pandas_df=df_data,
table_name='OVERNIGHTSTAYS',
force=True,
replace=False)
You can also see the data types of the columns that were created. the df_remote object points to the newly created table, and the dtypes()-method returns their names and column types.
df_remote.dtypes()
You find this section's code in the Notebook "020 Time-series forecast.ipynb".
Usually you would first spend some time exploring data. We keep this part in the tutorial very short, but we encourage you to spend sufficient time on data exploration in your own projects.
For now retrieve a few rows of data, to make sure that there are no obvious problems (ie empty rows). The head()-method first restricts the dataset to avoid downloading unnecessary data. The collect()-method then downloads the data into Python (as Pandas DataFrame) to display the data.
df_remote = conn.table('OVERNIGHTSTAYS')
df_remote.head(5).collect()
Get a quick feel for the content of the tables. The describe()-method instructs SAP HANA Cloud to calculate for instance how many values are missing per column, or what the min / max / median values are for numerical columns.
df_remote.describe().collect()
Each of the values above was calculated in SAP HANA Cloud. Because the describe()-method was called, hana_ml created the required SELECT statement, which you can access as well. You can execute this SQL-statement also in other environments to retrieve the results.
print(df_remote.describe().select_statement)
You can also get a more graphical overview of the dataset in a canned report.
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(df_remote).build().display()
We move to the data preparation for our use case. We want to create a single forecast of all overnight stays by month. Hence we need to aggregate all detailed values by month. The hana_ml-DataFrame has a method for that aggregation. Remember, that this aggregation is also virtual. Essentially, it is adding a GROUP-clause to the hana_ml-DataFrame, which is a SELECT statement under the hood. Also look at a few rows of the monthly aggregates.
df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by='MONTH')
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()
Plot the monthly numbers to get a feel for how they have progressed over time.
df_data = df_rem_agg.collect()
import matplotlib.pyplot as plt
df_data.plot(x='MONTH')
plt.xticks( rotation='vertical');
Indeed, there is a nicely repeating pattern. Let's train a time-series model to capture this for a forecast. We use the AdditiveModelForecast algorithm, which is also known as Prophet.
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast()
amf.fit(data=df_rem_agg)
View the trained model in another canned report.
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(amf).build().display()
We have a trained time-series model, but we have not yet created a prediction. To create a forecast, we first need to create a hana_ml DataFrame, that contains the dates/months we would like forecasts for. Hence find out the most recent date of the training history: 2024-05-01 (May 1st 2024).
str_lastdate = df_rem_agg.tail(1, ref_col='MONTH').collect().iloc[0,0]
str_lastdate = str(str_lastdate)[0:10]
print(str_lastdate)
Based on the last known above date/month, create a new hana_ml DataFrame that lists the following 12 months.
from hana_ml.algorithms.pal.random import binomial
months_to_forecast=12
df_rem_future = binomial(conn, n=1, p=1, num_random=months_to_forecast)
df_rem_future = df_rem_future.select('*', (f'''ADD_MONTHS(TO_DATE ('{str_lastdate}', 'YYYY-MM-DD'), ID+1)''', 'MONTH') )
df_rem_future = df_rem_future.select('MONTH', ('0', 'TARGET'))
df_rem_future.head(20).collect()
Now apply the trained time-series model to predict the overnight stays in those 12 months.
df_rem_predicted = amf.predict(data=df_rem_future)
df_rem_predicted.head(5).collect()
The canned report for the trained model now also shows the above forecast. You can clearly see how a repeating pattern was identified and used to predict the future dates. Most overnight stays are in the summer (hiking?) but the winter also has its peaks (skiing and snowboarding!).
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(amf).build().display()
To share the forecasts with business users it's often best to combine the training data and forecasts into a single dataset. Here we combine both parts into a single hana_ml DataFrame.
df_rem_predicted = df_rem_predicted.select('MONTH',
('NULL', 'OVERNIGHTSTAYS_SUM'),
('YHAT', 'FORECAST'),
('YHAT_LOWER', 'FORECAST_LOWER'),
('YHAT_UPPER', 'FORECAST_UPPER'))
df_rem_agg = df_rem_agg.select('*',
('NULL', 'FORECAST'),
('NULL', 'FORECAST_LOWER'),
('NULL', 'FORECAST_UPPER'))
df_rem_all = df_rem_predicted.union(df_rem_agg)
df_rem_all.sort('MONTH').tail(5).collect()
Now save the hana_ml DataFrame to a table in SAP HANA Cloud. SAP Analytics Cloud for instance can now access the data.
df_rem_all.save('OVERNIGHTSTAYS_FORECAST_TOTAL', force=True)
You have created a time-series forecast!
You find this section's code in the Notebook "30 Multiple time-series forecast.ipynb".
In the previous example we aggregated all overnights stays into monthly values and forecasted the following 12 months. Now we want to create individual forecasts for the countries from which most visitors are coming. Hence let's find the top 10 countries with the most overnight stays.
df_remote = conn.table('OVERNIGHTSTAYS')
df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE' ).sort('TOTAL', desc=True).head(10).collect()
We want to use that list of countries in a filter on the whole table. One way of achieving this, is to create a comma separated list of values for a where clause.
countries = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE') \
.sort('TOTAL', desc=True).select('COUNTRYOFRESIDENCE').head(10).collect() \
.iloc[:,0].tolist()
countries = str(countries)
countries = countries.replace('[', '(')
countries = countries.replace(']', ')')
df_remote = df_remote.filter(f'''"COUNTRYOFRESIDENCE" IN {countries}''')
The hana_ml DataFrame now only holds information about those 10 countries. There are many options to filter the hana_ml DataFrame on such a Top N criteria, in the section "Going further" you see an alternative, which uses an inner join to avoid having to extract the list of values.
Now we aggregate the data again, but this time by both month and country, so that we can create country-specific forecasts.
df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by=['MONTH', 'COUNTRYOFRESIDENCE'])
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()
Let's see the historic values of the 10 countries.
df_data_plot = df_rem_agg.pivot_table(values='OVERNIGHTSTAYS_SUM', index='MONTH', columns='COUNTRYOFRESIDENCE', aggfunc='sum').collect()
df_data_plot.plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));
Since the plot is quite full, let's focus on the 2 most popular foreign countries, Germany and the United States.
df_data_plot[['MONTH', 'Germany', 'United States']].plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));
There are similarities in the overnight stays from the two countries (summers are most popular). However, numbers from the United States seem to be rising, while numbers from Germany seem to be dropping. Individual forecasts for each country can capture such individual patterns.
Train individual models with the same AdditiveModelForecast algorithm as before, but now we specify the country of residence as group key, which produces individual models for each country.
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(massive=True)
amf.fit(data=df_rem_agg, group_key='COUNTRYOFRESIDENCE')
As before, we now create a hana_ml DataFrame for the prediction. This DataFrame must list all the countries as well as all the dates for each country. The list of dates/months, as done before.
from hana_ml.algorithms.pal.random import binomial
months_to_forecast=12
str_lastdate = df_rem_agg.tail(1, ref_col='MONTH').collect().iloc[0,0]
str_lastdate = str(str_lastdate)[0:10]
df_rem_future = binomial(conn, n=1, p=1, num_random=months_to_forecast)
df_rem_future = df_rem_future.select('*', (f'''ADD_MONTHS(TO_DATE ('{str_lastdate}', 'YYYY-MM-DD'), ID+1)''', 'MONTH') )
df_rem_future = df_rem_future.select('MONTH', ('0', 'TARGET'))
But now we create a cartesian product with the countries.
df_rem_topredict = df_rem_future.add_id().set_index('ID').join(df_rem_countries.add_id('ID').set_index('ID'), how='cross' )
df_rem_topredict = df_rem_topredict.drop('ID')
df_rem_topredict.head(10).collect()
Apply the individual models on its corresponding dates/months in the prediction DataFrame. Here the predict() method returns multiple objects. The forecast we need are in the first (with index 0).
df_rem_pred = amf.predict(data=df_rem_topredict, group_key='COUNTRYOFRESIDENCE')
df_rem_predicted = df_rem_pred[0]
That's all it took to mass produce individual time-series models and forecasts. To verify that the forecasts are individual, look at the data for Germany...
from hana_ml.visualizers.visualizer_base import forecast_line_plot
forecast_line_plot(pred_data=df_rem_predicted.filter(''' "GROUP_ID" = 'Germany' ''').drop('GROUP_ID').set_index('MONTH'),
confidence=("YHAT_LOWER", "YHAT_UPPER"), enable_plotly=True)
... and the United States.
from hana_ml.visualizers.visualizer_base import forecast_line_plot
forecast_line_plot(pred_data=df_rem_predicted.filter(''' "GROUP_ID" = 'United States' ''').drop('GROUP_ID').set_index('MONTH'),
confidence=("YHAT_LOWER", "YHAT_UPPER"), enable_plotly=True)
Combine the historic data with the predictions and save them to a single table.
df_rem_predicted = df_rem_predicted.select(
'MONTH',
('GROUP_ID', 'COUNTRYOFRESIDENCE'),
('NULL', 'OVERNIGHTSTAYS_SUM'),
('YHAT', 'FORECAST'),
('YHAT_LOWER', 'FORECAST_LOWER'),
('YHAT_UPPER', 'FORECAST_UPPER')
)
df_rem_agg = df_rem_agg.select('*', ('NULL', 'FORECAST'),
('NULL', 'FORECAST_LOWER'),
('NULL', 'FORECAST_UPPER')
)
df_rem_all = df_rem_predicted.union(df_rem_agg)
df_rem_all.save('OVERNIGHTSTAYS_FORECAST_COUNTRIES', force=True)
You have the data ready to be shared with business users, for instance through SAP Analytics Cloud. Of course you can also see the data here in Python. Select a country and put the data into a plot.
import matplotlib.pyplot as plt
df_data = df_rem_all.filter('''"COUNTRYOFRESIDENCE" = 'Germany' ''').collect()
fig, ax = plt.subplots()
ax.fill_between(df_data['MONTH'].values, df_data['FORECAST_LOWER'].values, df_data['FORECAST_UPPER'].values, alpha=0.2)
ax.plot(df_data['MONTH'].values, df_data['OVERNIGHTSTAYS_SUM'].values, '-')
plt.xticks(rotation=45);
Now you are the expert in massing producing time-series forecasts with SAP HANA Cloud!
This final chapter has a few ideas on how to continue from the above.
To avoid having to use the Database user and its password in clear text, you can also refer to logon credentials kept in the Secure User Store from the SAP HANA client. This command stores the credentials under the key "MYHANACLOUD". Because of the "-i" parameter you will be prompted for the password. This ensures that the password does not become visible in the prompt history.
C:\Program Files\SAP\hdbclient>hdbuserstore -i SET MYHANACLOUD "YOURSQLENDPOINT:443" AIUSER
You can use this key to logon to SAP HANA Cloud.
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(userkey='MYHANACLOUD')
conn.connection.isconnected()
You can automate the creation of new forecasts by scheduling the hana_ml logic. One option is described in the blog Scheduling Python code on Cloud Foundry.
In the above steps we kept it simple and allowed all IP addresses to connect to SAP HANA Cloud. See the SAP HANA Cloud Administration Guide on allowing only specific IP addresses or ranges.
An easy option to determine your external IP address is a small piece of code like:
from requests import get
ip = get('https://api.ipify.org').text
print(f'My public IP address is: {ip}')
Providing additional context can increase the forecast accuracy. One customer is using SAP's factory calendar table TFACS to flag working days in a daily forecast. For our monthly data of overnight stays in Switzerland, the exchange rates of the different currencies to the Swiss Franc could be useful for example. See the holiday and exog parameters of the fit()-method to pass such context to the algorithm.
In the above example "Forecasting multiple time-series" we are filtering the hana_ml DataFrame on the 10 countries with the highest number of overnight stays. To make it easy to follow the logic, we implemented the filter by extracting the list of countries to essentially create an SQL WHERE clause. There are surely more elegant ways, that do not require any data extraction. One option would be an inner join between the full hana_ml DataFrame and an hana_ml DataFrame that contains the distinct list of Top N countries.
df_remote = conn.table('OVERNIGHTSTAYS')
df_rem_top10 = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE' ).sort('TOTAL', desc=True).head(10)
df_remote = df_remote.set_index('COUNTRYOFRESIDENCE').join(df_rem_top10.set_index('COUNTRYOFRESIDENCE'), how='inner')
We hope this hands on blog post enables you to supercharge your SAP HANA Cloud with Machine Learning. Transform your SAP data into actionable insights and harness the full potential of HANA ML!
To close off this tutorial, we have a few ideas for you to continue your journey:
📥Get hands-on: If you have scrolled through the blog, take the time to implement the scenario yourself. You can download the notebooks and learn by doing.
📚Learn more:
🔗Join the Discussion: Share your experiences in the comments and connect with us.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
4 | |
1 | |
1 | |
1 | |
1 | |
1 |