Artificial Intelligence Blogs Posts
cancel
Showing results for 
Search instead for 
Did you mean: 
AndreasForster
Product and Topic Expert
Product and Topic Expert
12,833

Since SAP HANA Cloud is a multi-model database management system, it can store, process and enrich different kinds of data (ie tables, geo-spatial, graph) with its in-built engines. This blog shows how to implement a SAP Business AI project with a time-series forecast by using the embedded Predictive Analysis Library

Trigger a forecast from your favourite Python environment without having to extract the data from SAP HANA Cloud, thanks to our Python package hana_ml. Avoid data duplication (keeping your architecture lean) and benefit from the in-memory power. 

The whole code of this project can be downloaded from this repository.

00 banner.png

Table of contents

  • Use case
  • Architecture / prerequisites
  • SAP HANA Cloud
  • Python / Jupyter Notebooks
  • Time-series forecasting
    • Data upload
    • Forecasting single time-series
    • Forecasting multiple time-series
  • Going further

 FYI, the new blogging platform on the SAP Community now allows for multiple authors to work on the same document. That's our first attempt, and we ourselves at least like the outcome 😀 @AndreasForster and @YannickSchaper 

 

Use case

We have chosen to go through a time-series forecasting example, as this is an extremely common requirement for the most diverse projects. Many standard applications from SAP already have time-series forecasting built in, such as SAP Analytics Cloud or SAP Integrated Business PlanningIf you have a requirement, that goes beyond a standard application, you can leverage the SAP Business Technology Platform (BTP), ie SAP HANA Cloud. 

In our case now we would like to forecast how many nights foreign tourists or business travellers will spend in Switzerland in the coming months. We are interested in the total numbers, but we also want to know how many nights different nationalities will spend in the country. 

Such forecasts of overnight stays can be useful for many purposes, for instance to help decide where to spend a Marketing budget. It can also help the hotel industry to have the right number of staff (and languages) working through the year.

These might just be examples we invented ourselves, based on data that is publicly available (thanks to the opendata.swiss portal!), but such time-series forecasts are used extremely often in the business world. We have used time-series forecasting in many diverse projects, from forecasting the Net Working Capital for Treasury departments to solar power production at a utilities company. 

All these use cases have in common, that in such a BTP project an expert (you! 😀 ) creates a forecast that is then shared with the business users, who might not want to know all technical details of how the forecasts were created.

 

Architecture / prerequisites

To follow this blog's instructions, you need to have access to:

  • An instance of SAP HANA Cloud, either a productive or free-tier or trial system. When the blog was published in August 2024, the free trial did not have the Predictive Analysis built in. Hence in the next chapter we give some guidance on setting up a free-tier system using a Pay-As-You-Go account, which should allow everyone with a credit card (that we understand will not be charged) to implement the scenario without occurring any costs. In October 2024 the Predictive Analysis Library was added to the trial system, so you should also be able to follow hands-on with such an instance.
  • A Python environment. We explain how to set up miniconda, but if you already have an existing Python installation, you should be able to use this as well. Of course, you could also use the Business Application Studio on our Business Technology Platform. 
  • It helps if you have a bit of Python or scripting experience. But we hope this blog can also be a good entry point for someone who is just starting out with Python.
  • Motivation to get to know SAP HANA's Predictive Analysis Library. 

So the overall architecture for this implementation is very straight forward. 

architecture.png

Currently the Predictive Analysis Library provides all these algorithms, and the list keeps growing. 

PAL algos.jpg

 

SAP HANA Cloud

First things first, you need a SAP HANA Cloud system, in which to execute the Machine Learning. This could be a productive system that you already have access to. You can also use a free-tier or trial instance of SAP HANA Cloud. To allow everyone to follow the steps hands-on, we are using a free-tier option with Pay-As-You-Go account here. 

If you wonder what the differences are between free-tier and trial, then this tutorial should help you out. With a paid instance of SAP HANA Cloud the setup of the Machine Learning is slightly differently, you need to activate the script server

Remember that productive SAP Datasphere systems also have the same Machine Learning built-in, thanks to the embedded SAP HANA Cloud. We have released a separate blog for this.

So let's start working with SAP HANA Cloud as free-tier. If you haven't got a system yet, you can follow the below tutorials. Just be aware, that not all BTP components are available in all regions. Make sure to have your BTP subaccount in a region, in which the SAP HANA Cloud free-tier is available. See the Discovery Center for options. We are using "Europe (Netherlands)" on Microsoft Azure.

015-redboxes create instance.jpg

Once SAP HANA Cloud is running, take note of the instance's SQL endpoint, which is needed to logon to the system from Python.

017 endpoint.jpg

 

Now create a Database User for our project. You can execute the code for example in the SAP HANA database explorer, as explained in: Tools to Manage and Access the SAP HANA Cloud, SAP HANA Database

Execute these commands as DBADMIN in the Database Explorer. Just replace YOURSECRETPASSWORD with literally your secret password (the one you had to enter when creating the instance). This creates a user called AIUSER, whose password doesn't expire. This user is also assigned the necessary rights to trigger the Predictive Analysis Library.

 

CREATE USER AIUSER PASSWORD YOURSECRETPASSWORD NO FORCE_FIRST_PASSWORD_CHANGE;
ALTER USER AIUSER DISABLE PASSWORD LIFETIME;
GRANT AFL__SYS_AFL_AFLPAL_EXECUTE TO AIUSER;
GRANT AFLPM_CREATOR_ERASER_EXECUTE to AIUSER;

 

Python / Jupyter Notebooks

Use a Python environment of your choice to be able to use Jupyter Notebooks. There are many options, ie with SAP (Business Application Studio) or non-SAP environments. The following steps show how to run Jupyter Notebooks on your local laptop by using the free environment miniconda.

Download and install miniconda. The easiest option is probably the graphical installer.

Open the "Anaconda Prompt" that was installed by minconda. On Windows just use the Operating System's "Search" box to find the prompt.

miniconda allows you to have multiple Python environments side by side. Making changes in one of these environments does not affect the others. This can be very useful to run tests in one environment without interfering with your productive code. Enter the following code into the Anaconda Prompt, execute each line individually. These statements:

  • create a new Python 3.12 environment (which is currently the stable version of Python) named "sandbox"
  • activate the new "sandbox" environment, so that the following commands are executed for this environment
  • install the python package jupyterlab, which, you might have guessed, installs the Jupyter Notebook environment into the blank Python environment
  • add the "sandbox" envrionment to the Jupyter environment (you will see later, what this means)
  • and finally, open the Jupyter environment.

 

conda create --name sandbox python=3.12
conda activate sandbox
pip install jupyterlab
python -m ipykernel install --user --name=sandbox
jupyter lab

 

100 anaconda - blur.png

Executing these commands will open the Jupyter Lab environment, in which individual Jupyter Notebooks can be executed. Notice the "sandbox" icon in the Notebook section. This icon appears because of the earlier "python -m ipykernel install --user --name=sandbox" command. 

110 jupyter lab.png

Click the "sandbox" icon in the Notebook section and a Jupyter Notebooks opens up, in which you can script your Python code. At the top right you can see that the Notebook is running in the "sandbox" environment. Eventually you might have a number of Python environments, hence it is good to know for sure, in which of these environments the code is executed.

120 notebook.png

In case you haven't worked with Jupyter Notebooks yet, this introduction to JupyterLab gets you started. Once you are familiar with the Jupyter interface, install the hana_ml Package, that we will use to trigger the embedded Machine Learning. Enter this code into a cell in the Notebook, then execute the code, for example with the "Run" icon on top.

 

!pip install hana_ml

 

155 install  hana_ml.png

This downloads and install the hana_ml-package with its dependencies, which can take a few seconds. Once complete, run this code in a new cell, to see the version of the freshly installed package. 

 

import hana_ml
print(hana_ml.__version__)

 

Later on we will also need the matplotlib and flask packages when charting the numbers. Run this command in a new cell.

 

!pip install matplotlib

 

And also this command in another cell.

 

!pip install flask

 

Time-series forecasting

We are ready for the fun part, everything should be ready to upload some data and create some time-series forecasts. Remember, that all code can be downloaded from this repository. The notebooks that you find there are a little more elaborate than then blog.

Data upload

You find this section's code in the Notebook "010 Data upload.ipynb".

Begin by logging on to SAP HANA Cloud, through the hana_ml package in a Jupyter Notebook. Use the credentials of the AIUSER.

 

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='YOURENDPOINT',
                                   port=443, 
                                   user='AIUSER', 
                                   password='YOURSECRETPASSWORD', 
                                  )
conn.connection.isconnected()

 

Now bring the data that we want to upload from this CSV file into a Pandas DataFrame in Python. Display a few rows of the data.

 

import pandas as pd
df_data= pd.read_csv('OVERNIGHTSTAYS.csv') 
df_data.MONTH = pd.to_datetime(df_data.MONTH, format='%d/%m/%Y') 
df_data.head(5)

 

The table contains of four columns

  • The month
  • The Swiss region
  • The nationality of the guests
  • And the number of overnight stays 

210 data preview.png

Then upload the Pandas DataFrame through the hana_ml package into a table in SAP HANA Cloud. The table and its columns are created automatically on the fly.

 

import hana_ml.dataframe as dataframe
df_remote = dataframe.create_dataframe_from_pandas(connection_context=conn, 
                                                   pandas_df=df_data, 
                                                   table_name='OVERNIGHTSTAYS',
                                                   force=True,
                                                   replace=False)

 

You can also see the data types of the columns that were created. the df_remote object points to the newly created table, and the dtypes()-method returns their names and column types.

 

df_remote.dtypes()

 

220 data types.png

 

Forecasting single time-series

You find this section's code in the Notebook "020 Time-series forecast.ipynb".

Usually you would first spend some time exploring data. We keep this part in the tutorial very short, but we encourage you to spend sufficient time on data exploration in your own projects.

For now retrieve a few rows of data, to make sure that there are no obvious problems (ie empty rows). The head()-method first restricts the dataset to avoid downloading unnecessary data. The collect()-method then downloads the data into Python (as Pandas DataFrame) to display the data.

 

df_remote = conn.table('OVERNIGHTSTAYS')
df_remote.head(5).collect()

 

300 peak.png

Get a quick feel for the content of the tables. The describe()-method instructs SAP HANA Cloud to calculate for instance how many values are missing per column, or what the min / max / median values are for numerical columns.

 

df_remote.describe().collect()

 

310 describe.png

Each of the values above was calculated in SAP HANA Cloud. Because the describe()-method was called, hana_ml created the required SELECT statement, which you can access as well. You can execute this SQL-statement also in other environments to retrieve the results.

 

print(df_remote.describe().select_statement)

 

320 describe select.png

You can also get a more graphical overview of the dataset in a canned report.

 

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(df_remote).build().display()

 

330 unified data report.png

We move to the data preparation for our use case. We want to create a single forecast of all overnight stays by month. Hence we need to aggregate all detailed values by month. The hana_ml-DataFrame has a method for that aggregation. Remember, that this aggregation is also virtual. Essentially, it is adding a GROUP-clause to the hana_ml-DataFrame, which is a SELECT statement under the hood. Also look at a few rows of the monthly aggregates.

 

df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by='MONTH')
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()

 

340 monthly data.png

 

Plot the monthly numbers to get a feel for how they have progressed over time.

 

df_data = df_rem_agg.collect()
import matplotlib.pyplot as plt
df_data.plot(x='MONTH')
plt.xticks( rotation='vertical');

 

345 monthly line chart.png

 

Indeed, there is a nicely repeating pattern. Let's train a time-series model to capture this for a forecast. We use the AdditiveModelForecast algorithm, which is also known as Prophet.

 

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast()
amf.fit(data=df_rem_agg)

 

View the trained model in another canned report.

 

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(amf).build().display()

 

350 ts report.png

We have a trained time-series model, but we have not yet created a prediction. To create a forecast, we first need to create a hana_ml DataFrame, that contains the dates/months we would like forecasts for. Hence find out the most recent date of the training history: 2024-05-01 (May 1st 2024).

 

str_lastdate = df_rem_agg.tail(1, ref_col='MONTH').collect().iloc[0,0]
str_lastdate = str(str_lastdate)[0:10]
print(str_lastdate)

 

Based on the last known above date/month, create a new hana_ml DataFrame that lists the following 12 months.

 

from hana_ml.algorithms.pal.random import binomial
months_to_forecast=12
df_rem_future = binomial(conn, n=1, p=1, num_random=months_to_forecast)
df_rem_future = df_rem_future.select('*', (f'''ADD_MONTHS(TO_DATE ('{str_lastdate}', 'YYYY-MM-DD'), ID+1)''', 'MONTH') )
df_rem_future = df_rem_future.select('MONTH', ('0', 'TARGET'))
df_rem_future.head(20).collect()

 

360 future dates.png

Now apply the trained time-series model to predict the overnight stays in those 12 months.

 

df_rem_predicted = amf.predict(data=df_rem_future)
df_rem_predicted.head(5).collect()

 

370 forecast.png

The canned report for the trained model now also shows the above forecast. You can clearly see how a repeating pattern was identified and used to predict the future dates. Most overnight stays are in the summer (hiking?) but the winter also has its peaks (skiing and snowboarding!).

 

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(amf).build().display()

 

380 ts report 2.png

To share the forecasts with business users it's often best to combine the training data and forecasts into a single dataset. Here we combine both parts into a single hana_ml DataFrame.

 

df_rem_predicted = df_rem_predicted.select('MONTH',
                                          ('NULL', 'OVERNIGHTSTAYS_SUM'),
                                          ('YHAT', 'FORECAST'),
                                          ('YHAT_LOWER', 'FORECAST_LOWER'),
                                          ('YHAT_UPPER', 'FORECAST_UPPER'))
df_rem_agg = df_rem_agg.select('*', 
                               ('NULL', 'FORECAST'),
                               ('NULL', 'FORECAST_LOWER'),
                               ('NULL', 'FORECAST_UPPER'))
df_rem_all = df_rem_predicted.union(df_rem_agg)
df_rem_all.sort('MONTH').tail(5).collect()

 

390 forecast.png

Now save the hana_ml DataFrame to a table in SAP HANA Cloud. SAP Analytics Cloud for instance can now access the data.

 

df_rem_all.save('OVERNIGHTSTAYS_FORECAST_TOTAL', force=True)

 

You have created a time-series forecast!

 

Forecasting multiple time-series

You find this section's code in the Notebook "30 Multiple time-series forecast.ipynb".

In the previous example we aggregated all overnights stays into monthly values and forecasted the following 12 months. Now we want to create individual forecasts for the countries from which most visitors are coming. Hence let's find the top 10 countries with the most overnight stays.

 

df_remote = conn.table('OVERNIGHTSTAYS')
df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE' ).sort('TOTAL', desc=True).head(10).collect()

 

500 top ten.png

We want to use that list of countries in a filter on the whole table. One way of achieving this, is to create a comma separated list of values for a where clause.

 

countries = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE') \
     .sort('TOTAL', desc=True).select('COUNTRYOFRESIDENCE').head(10).collect() \
     .iloc[:,0].tolist()
countries = str(countries)
countries = countries.replace('[', '(')
countries = countries.replace(']', ')')
df_remote = df_remote.filter(f'''"COUNTRYOFRESIDENCE" IN {countries}''')

 

The hana_ml DataFrame now only holds information about those 10 countries. There are many options to filter the hana_ml DataFrame on such a Top N criteria, in the section "Going further" you see an alternative, which uses an inner join to avoid having to extract the list of values.

Now we aggregate the data again, but this time by both month and country, so that we can create country-specific forecasts.

 

df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by=['MONTH', 'COUNTRYOFRESIDENCE'])
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()

 

530 aggregated.png

Let's see the historic values of the 10 countries.

 

df_data_plot = df_rem_agg.pivot_table(values='OVERNIGHTSTAYS_SUM', index='MONTH', columns='COUNTRYOFRESIDENCE', aggfunc='sum').collect()
df_data_plot.plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));

 

540 plot top 10.png

Since the plot is quite full, let's focus on the 2 most popular foreign countries, Germany and the United States.

 

df_data_plot[['MONTH', 'Germany', 'United States']].plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));

 

550 plot 2 countries.png

There are similarities in the overnight stays from the two countries (summers are most popular). However, numbers from the United States seem to be rising, while numbers from Germany seem to be dropping. Individual forecasts for each country can capture such individual patterns.

Train individual models with the same AdditiveModelForecast algorithm as before, but now we specify the country of residence as group key, which produces individual models for each country.

 

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(massive=True)
amf.fit(data=df_rem_agg, group_key='COUNTRYOFRESIDENCE')

 

As before, we now create a hana_ml DataFrame for the prediction. This DataFrame must list all the countries as well as all the dates for each country. The list of dates/months, as done before.

 

from hana_ml.algorithms.pal.random import binomial
months_to_forecast=12
str_lastdate = df_rem_agg.tail(1, ref_col='MONTH').collect().iloc[0,0]
str_lastdate = str(str_lastdate)[0:10]
df_rem_future = binomial(conn, n=1, p=1, num_random=months_to_forecast)
df_rem_future = df_rem_future.select('*', (f'''ADD_MONTHS(TO_DATE ('{str_lastdate}', 'YYYY-MM-DD'), ID+1)''', 'MONTH') )
df_rem_future = df_rem_future.select('MONTH', ('0', 'TARGET'))

 

But now we create a cartesian product with the countries. In case you are implementing this example yourself hands-on, please also check the Notebook "30 Multiple time-series forecast.ipynb" for a fully working example. There you will find for instance the code that creates the hana_ml DataFrame df_rem_countries that is used here. 

 

df_rem_topredict = df_rem_future.add_id().set_index('ID').join(df_rem_countries.add_id('ID').set_index('ID'), how='cross' )
df_rem_topredict = df_rem_topredict.drop('ID')
df_rem_topredict.head(10).collect()

 

560 cartesian.png

Apply the individual models on its corresponding dates/months in the prediction DataFrame. Here the predict() method returns multiple objects. The forecast we need are in the first (with index 0).

 

df_rem_pred = amf.predict(data=df_rem_topredict, group_key='COUNTRYOFRESIDENCE')
df_rem_predicted = df_rem_pred[0]

 

That's all it took to mass produce individual time-series models and forecasts. To verify that the forecasts are individual, look at the data for Germany...

 

from hana_ml.visualizers.visualizer_base import forecast_line_plot
forecast_line_plot(pred_data=df_rem_predicted.filter(''' "GROUP_ID" = 'Germany' ''').drop('GROUP_ID').set_index('MONTH'),
                   confidence=("YHAT_LOWER", "YHAT_UPPER"), enable_plotly=True)

 

570 forecast germany.png

... and the United States.

 

from hana_ml.visualizers.visualizer_base import forecast_line_plot
forecast_line_plot(pred_data=df_rem_predicted.filter(''' "GROUP_ID" = 'United States' ''').drop('GROUP_ID').set_index('MONTH'),
                   confidence=("YHAT_LOWER", "YHAT_UPPER"), enable_plotly=True)

 

580 forecast usa.png

Combine the historic data with the predictions and save them to a single table.

 

df_rem_predicted = df_rem_predicted.select(
                                  'MONTH',
                                  ('GROUP_ID', 'COUNTRYOFRESIDENCE'),
                                  ('NULL', 'OVERNIGHTSTAYS_SUM'),
                                  ('YHAT', 'FORECAST'),
                                  ('YHAT_LOWER', 'FORECAST_LOWER'),
                                  ('YHAT_UPPER', 'FORECAST_UPPER')
                                 )
df_rem_agg = df_rem_agg.select('*', ('NULL', 'FORECAST'),
                                  ('NULL', 'FORECAST_LOWER'),
                                  ('NULL', 'FORECAST_UPPER')
                                  )
df_rem_all = df_rem_predicted.union(df_rem_agg)
df_rem_all.save('OVERNIGHTSTAYS_FORECAST_COUNTRIES', force=True)

 

You have the data ready to be shared with business users, for instance through SAP Analytics Cloud. Of course you can also see the data here in Python. Select a country and put the data into a plot.

 

import matplotlib.pyplot as plt
df_data = df_rem_all.filter('''"COUNTRYOFRESIDENCE" = 'Germany' ''').collect()
fig, ax = plt.subplots()
ax.fill_between(df_data['MONTH'].values, df_data['FORECAST_LOWER'].values, df_data['FORECAST_UPPER'].values, alpha=0.2)
ax.plot(df_data['MONTH'].values, df_data['OVERNIGHTSTAYS_SUM'].values, '-')
plt.xticks(rotation=45);

 

590 Germany past and prediction.png

Now you are the expert in massing producing time-series forecasts with SAP HANA Cloud!

 

Going Further

This final chapter has a few ideas on how to continue from the above.

Secure logon

To avoid having to use the Database user and its password in clear text, you can also refer to logon credentials kept in the Secure User Store from the SAP HANA client. This command stores the credentials under the key "MYHANACLOUD". Because of the "-i" parameter you will be prompted for the password. This ensures that the password does not become visible in the prompt history.

C:\Program Files\SAP\hdbclient>hdbuserstore -i SET MYHANACLOUD "YOURSQLENDPOINT:443" AIUSER

You can use this key to logon to SAP HANA Cloud.

 

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(userkey='MYHANACLOUD')
conn.connection.isconnected()

 

Scheduling

You can automate the creation of new forecasts by scheduling the hana_ml logic. One option is described in the blog Scheduling Python code on Cloud Foundry.

Restrict IP addresses

In the above steps we kept it simple and allowed all IP addresses to connect to SAP HANA Cloud. See the SAP HANA Cloud Administration Guide on allowing only specific IP addresses or ranges.

An easy option to determine your external IP address is a small piece of code like:

 

from requests import get
ip = get('https://api.ipify.org').text
print(f'My public IP address is: {ip}')

 

Improve forecast accuracy with external variables

Providing additional context can increase the forecast accuracy. One customer is using SAP's factory calendar table TFACS to flag working days in a daily forecast. For our monthly data of overnight stays in Switzerland, the exchange rates of the different currencies to the Swiss Franc could be useful for example. See the holiday and exog parameters of the fit()-method to pass such context to the algorithm.

Filtering on Top N

In the above example "Forecasting multiple time-series" we are filtering the hana_ml DataFrame on the 10 countries with the highest number of overnight stays. To make it easy to follow the logic, we implemented the filter by extracting the list of countries to essentially create an SQL WHERE clause. There are surely more elegant ways, that do not require any data extraction. One option would be an inner join between the full hana_ml DataFrame and an hana_ml DataFrame that contains the distinct list of Top N countries.

 

df_remote = conn.table('OVERNIGHTSTAYS')
df_rem_top10 = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE' ).sort('TOTAL', desc=True).head(10)
df_remote = df_remote.set_index('COUNTRYOFRESIDENCE').join(df_rem_top10.set_index('COUNTRYOFRESIDENCE'), how='inner')

 

Summary and next steps

We hope this hands on blog post enables you to supercharge your SAP HANA Cloud with Machine Learning. Transform your SAP data into actionable insights and harness the full potential of HANA ML!

To close off this tutorial, we have a few ideas for you to continue your journey:

📥Get hands-on: If you have scrolled through the blog, take the time to implement the scenario yourself. You can download the notebooks and learn by doing.

📚Learn more:

🔗Join the Discussion: Share your experiences in the comments and connect with us.

13 Comments
ChristophMorgen
Product and Topic Expert
Product and Topic Expert

@AndreasForster,  I am happy to say, embedded #MachineLearning is now also available with #SAPHANACloud 𝐭𝐫𝐢𝐚𝐥 🚀 😊 !

SAPHANACloud-trail-PAL.jpg

You can sign-up for SAP HANA Cloud trial here https://developers.sap.com/tutorials/hana-cloud-mission-trial-1.html

AndreasForster
Product and Topic Expert
Product and Topic Expert

Thank you @ChristophMorgen for the good news that the Predictive Analysis Library is now also available in the trial! I have added the info to the blog, the text should be visible shortly after the change has been reviewed and approved. @YannickSchaper 

AdrianDBorja
Explorer
0 Likes

Hi Andreas 

First of all thanks a lot for this blog to get us all started on the AI / ML journey! I tried to follow the steps you outlined as is, and I have the following questions, hope you can enlighten.

1) When I include massive=True parameter in AdditiveModelForecast, it gives me error that says 'feature not supported: Cannot use local temporary table'  leaving it blank as in AdditiveModelForecast(), then there is no error. What is the significance of this massive=True, and is it functionally ok to execute this method without this parameter?

2) After my AdditiveModelForecast() call, I call the fit() command with group_key, as in:

amf = AdditiveModelForecast()
amf.fit(data=df_rem_agg, group_key='MATNR'
 
But actually, my data model looks like this:
MATNR,CATEGORY,ERDAT,QTY
000000000013204583,ASORT1,20241004,0.250000000000000
000000000013204538,ASORT2,20240829,0.400000000000000
000000000013204538,ASORT2,20240829,0.400000000000000
000000000013204538,ASORT2,20240829,0.400000000000000
000000000013204537,ASORT3,20240829,0.222000000000000
000000000013204537,ASORT3,20240829,0.300000000000000
 
Means I want to group by MATNR and CATEGORY for the time series model, and not just MATNR, but the group_key only accepts string, not array. I thought I can enter it as group_key=['MATNR','CATEGORY']
 
Is this a limitation, or there is another way of defining this based on two dimensions(MATNR and CATEGORY)?
 
3) The example is also using read_csv for the input data. I have extracted the CSV but my real data model is in my HANA Database actually, not as a single table, but a result of multiple joins. I'm aware that I can refer to HANA table using 
conn.table('Table_name', schema='schema_name'), but it does not allow a whole SELECT statement in replacement of the Table_name.
Is this a limitation, or there is another way to refer to a HANA source that is defined by joins?
 
4) There is a reference to df_rem_countries but I don't see it defined anywhere, so I'm kind of stuck.

Thanks in advance!

Vitaliy-R
Developer Advocate
Developer Advocate

Hi @AdrianDBorja . Regarding issue 1: are using HANA Cloud db in the Free Tier or BTP Trial, which have some restrictions: https://help.sap.com/docs/hana-cloud/sap-hana-cloud-administration-guide/sap-hana-database-license#f...

Regards.

AndreasForster
Product and Topic Expert
Product and Topic Expert

Hello @AdrianDBorja , Thanks for the feedback.

For 1) massive=True allows to train multiple time series forecast with a single command. In the blog it is used in combination with the group_key to forecast indidvidual values for each country. Maybe you get the error when setting massive=True but without providing that group_key?

For 2) The group_key accepts indeed only a string, not a list or an array. This means you just need to concat the values of the multiple grouping columns together into one column. The name of this new column is then the group_key. Here is an example:

df_remote.select('*', ("REGION || '-' || COUNTRYOFRESIDENCE", 'CONCATCOL'))

For 3) In case you already have a SELECT statement (which might join multiple tables) you can base the hana_ml DataFrame on that custom sql.

df_remote = conn.sql('Your SELECT statement...')

Alternatively you can also use the hana_ml package to join multiple hana_ml DataFrames. Something like:

df_remote = conn.table('USEDCARS').set_index('CAR_ID') \
.join(df_rem_accepted.select(['CAR_ID', 'PREDICTED']) \
.set_index('CAR_ID'), how='left')

For 4) df_rem_countries is created in the Notebook that has all the code. Thanks for letting me know, I will make that clearer in the text

https://github.com/SAP-samples/btp-global-center-of-excellence-samples/blob/main/Hands-on%20Tutorial...

AdrianDBorja
Explorer
0 Likes

@Vitaliy-R this is enterprise acct actually

AdrianDBorja
Explorer
0 Likes

@AndreasForster thanks for the advise, I hadn't noticed the notebook earlier, now I'm using it as reference. 

So in comparison to the code in the notebook, replacing the part where dataframe is from CSV, I have used SQL instead.

I have reached the part where future forecast dates are generated and joined with the dimensions, i.e.:

df_rem_topredict df_rem_future.add_id().set_index('ID').join(df_rem_matnr.add_id('ID').set_index('ID'), how='cross' )
df_rem_topredict = df_rem_topredict.drop('ID')
df_rem_topredict.head(5).collect()
 
and I know df_rem_topredict has contents because this is then the output of the collect()
AdrianDBorja_0-1740632553376.png

But once I execute the following after it, df_rem_pred is empty:

df_rem_pred = amf.predict(data=df_rem_topredict, group_key='MATNRFORMAT')
df_rem_predicted = df_rem_pred[0]
df_rem_pred[1].collect()
AdrianDBorja_1-1740632667367.png

amf was declared like this

amf = AdditiveModelForecast(massive=True)
amf.fit(data=df_rem_agg, group_key='MATNRFORMAT'
 
and df_rem_agg has contents from head(5).collect() as:
AdrianDBorja_2-1740632808056.png

There is no error but for some reason the arrays inside amf.predict just turned out empty. I print() the result of amf.predict and it has 3 elements:

(<hana_ml.dataframe.DataFrame object at 0x7f869ed7ee10>, <hana_ml.dataframe.DataFrame object at 0x7f869ed7fe90>, <hana_ml.dataframe.DataFrame object at 0x7f869ed7fd10>)

But when I ran collect() on each of them, first 2 arrays are empty, third one has this error(Message is also truncated):

PAL_MASSIVE_PROCESSING_SPECIAL_GROUP_ID

Error code: 73001050

Message: Internal error: encounter error when splitting...

 Does this error mean anything?

AndreasForster
Product and Topic Expert
Product and Topic Expert

@AdrianDBorja Let me try to reach out to our Product Group on the error you are getting.
Just to double check though, the hands-on example in this blog is working for you. Your question is about applying the same concept on your own data?

In the meantime I suggest to simplify. (How many different values do you have in MATNRFORAT by the way?): df_remote.distinct('MATNRFORAT').count()

One way to simplify would be to work with a single time series to begin. Also, currently you have individual dates. For events that occur not very frequently this could be a challenge. Do you need daily forecasts or would monthly also help? You can aggregate the data by month with an approach like this

# Add column that turns the data into the first day of the month
df_rem_prep = df_rem_test.select('*', ("TO_DATE(YEAR(DATEORG) || '-' || MONTH(DATEORG) || '-01')", 'FIRSTDAYOFMONTH'))
# Then aggregate on the month, as basis for a time series forecast
df_rem_prep.agg([('sum', 'KPI', 'KPI_AGG')], group_by='FIRSTDAYOFMONTH').collect()

ChristophMorgen
Product and Topic Expert
Product and Topic Expert

@AdrianDBorja which version of hana_ml are you using? Could you upgrade to the latest from pypi.org/hana-ml ?

Can you filter the dataframes on a single group-key and apply fit / predict without the grouping column?

Another hint, from the explanations I am not 100% sure, your predict-dataframe, does it include the target column (with empty values) or not?

Furthermore, this notebook incl. an example on how to use AdditiveModelForecast in massive-mode / group-key and as well the amf.make_future_dataframe method.

Best regards,
Christoph

Gowtham27
Discoverer
0 Likes

Hello @AndreasForster,

I was trying to connect HANA database to Python as guided in the blog. 

I'm receiving the following error when i was trying to connect HANA cloud. I'm using BTP trial account. 

Code: - 

Gowtham27_0-1742793331513.png

Error: - 

 

Gowtham27_2-1742793764041.png

I'm using address as SQL-End point as guided in the blog. It doesn't work. 

Gowtham27_3-1742793967815.png

I have cross verified using Ping command, it says "Ping request could not find host."

 

AndreasForster
Product and Topic Expert
Product and Topic Expert

Hi @Gowtham27, I wonder whether your corporate firewall / proxy might be blocking the communication. A quick test could be to try the same code with the same credentials on a private laptop outside a your corporate network.

Another reason could be that you have old versions of the hana_ml and hdbcli Python packages. The connection at this point is done by hdbcli. https://pypi.org/project/hdbcli/

Gowtham27
Discoverer
0 Likes

Hi @AndreasForster,

Thank you!. It is working in my personal laptop. 

AndreasForster
Product and Topic Expert
Product and Topic Expert

Hello @Gowtham27 , Great, that identifies the cause of the error. Your corporate network is blocking the communication. If you want to run the code from within your network, you might need to specify the proxy settings. This is an example that worked for another customer, but this might be very different for each network.

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='YOURENDPOINT',
                                   port=443,
                                   user='YOURUSER',
                                   password='YOURPASSWORD',
                                   proxyHostname='YOURPROXYHOSTNAME',
                                   proxyPort=YOURPROXYPORT,
                                   proxyUserName='YOURPROXYUSERNAME',
                                   proxyPassword='YOURPROXYPASSWORD')
conn.connection.isconnected()

The connection parameters are documented here https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/ee592e89dcce4480a99571a4a...


But I have also had customers where their IT department had to explicitly open ports to make the connection work.