With SAP Databricks we now have a dedicated environment for Data Scientists within the SAP Business Data Cloud. This blog gives a practical introduction to this bespoke Databricks edition by implementing a bare bones demand forecast. SAP Databricks includes important functionality beyond what is explained in this entry-level tutorial., for example data sharing, experiment tracking or AutoML.
The integration of SAP Databricks is adding a new option to the SAP Business Technology Platform (BTP), giving customers more choice when creating custom extensions or applications. Existing components on the BTP, which you might already be familiar with, remain very relevant and strategic, such as:
And remember to check on the AI/ML functionality that has already been built into SAP standard applications, or what is in the pipeline to be released. Both released and planned functionality is shown in the Roadmap Explorer.
Note: All data, code and images that are used in this blog can be downloaded from this repository: Hands-on Tutorial SAP Databricks. In that repository you find the individual files but also the whole project exported as Databricks Archive.
Before we get going, big thanks to Stojan Maleschlijski for all the great collaboration, including exploring the SAP Databricks capabilities together! @stojanm
This blog aims to give a first introduction for carrying out a Machine Learning project on SAP Databricks. Trying to simulate a demand forecast, we will predict how many nights people spend in a hotel in Switzerland. We use some granular data that is kindly shared by the Swiss Statistics department.
A Data Scientist (you) creates a monthly forecast and sets up a schedule for the forecast to be updated every month. This data would then become part of a business process, maybe you want to provide the data to business users in a dashboard. This tutorial however focusses solely on creating and scheduling the forecast in SAP Databricks.
It's a simple time-series forecasting example, but the concept is still extremely relevant for so many different business requirements. Instead of forecasting how many visitors are staying overnight, you might need to forecast your sales quantities, your cash flow or even for topics that don't come immediately to mind. One customer for instance is using time-series forecasting to improve the data quality of external data that is loaded into the system. If a new value is outside the predicted range, there might be a data quality issue and IT can follow up, whether the data is correct or whether indeed a data quality issue cropped up.
In a productive system a typical architecture and data flow would look like this.
The architecture of this hands-on exercise is focused purely on getting some familiarity with SAP Databricks. The only interface we will be using is SAP Databricks. We use it to upload the data and to go through the steps of a Machine Learning project.
Getting started is pretty easy.
Start by bringing the historic data, on which we want to train a model, into the system. Upload the dataset OVERNIGHTSTAYS.csv with the Graphical User Interface of SAP Databricks. Follow the steps shown in this screencam. In case that you are working with the trial, then you need to change the Catalog drop-down as shown to "workspace".
HINT: In case the upload fails with the message "Table with same name already exists", then another user has already created the table in this shared environment. Due to access rights you won't be able to see that other user's table. Just change the table name in your upload to something unique and remember to use this name when accessing the table further on.
The data is uploaded. it is showing in the Catalog as (Delta)Table. Physically it is stored in the Object Store. The table's Overview tab lists the column names and their data types:
Move to the table's "Sample Data" tab to see some of the uploaded rows.
The first lines shows that in January 2022 there were 392.805 nights spent by Swiss residents in the area of Graubünden. In case you are not that familiar with Swiss geography, Graubünden is the largest Canton in Switzerland, it's in the Alps and includes beautiful places like St. Moritz and Davos.
You have two options to explore the data in SAP Databricks.
Let's look closer into the data. You can start with some SQL statements. You can go into the SQL Editor and type in your own syntax. The screencam shows two simple examples. We learn for instance that we have data from January 2022 to May 2025 to work with.
Or describe what you are looking for and have SAP Databrick's write the SQL for you! Click that small red-ish star icon to toggle on the Databricks Assistant.
Just try out what you are interested in. I was wondering for example:
It turns out that most overnight stays in Switzerland are by Swiss residents, followed by German residents. Maybe surprisingly residents from the United States are a close third.
Now continue the data exploration in a Notebook. Databricks Notebooks are quite special in that they can contain more than one scripting language. SAP Databricks supports both SQL as well as Python. The Notebook created in this section can also be downloaded from the repository.
Begin by creating a folder in which we will save our Notebooks. Go into the "Workspace" section, which is where such files are kept together. Within your user's workspace create a folder called "Demand forecast".
In that new folder create a first Notebook as shown in the next screencam. Rename it to "010 Data exploration". And add a short header in Markdown at the top. Markdown is a common way to add comments and context to the code in a Notebook. Adding "#" character at the beginning of a line formats the text as top-level heading. Two "#" characters make it a second-level heading, and so on. Without any "#" character the following text is formatted just normally as plain text.
To add the Markdown, you need to change the cell's language selector to "Markdown" as shown in the screencam. Enter your text. When done, click outside the cell and the formatting kicks in.
Since the cells can also execute SQL code, add the most recent SQL statement from the SQL Editor.
SELECT
COUNTRYOFRESIDENCE,
SUM(OVERNIGHTSTAYS) AS total_overnightstays
FROM
workspace.default.overnightstays
GROUP BY
COUNTRYOFRESIDENCE
ORDER BY
total_overnightstays DESC
The screencam shows how a new cell can be added to the Notebook, by hovering with the mouse at the bottom of the cell above. Select "Code" as cell type. Change the cell's language selector to "SQL", paste the code and run it with the blue play button. You will see the same result as before in the "SQL Editor".
Let's switch to Python for the data exploration so that you have tried all the options. Create a new Code cell, its language selector might already be on "Python" by default. Begin by loading the data into a PySpark DataFrame using the table's fully qualified path.
overnightstays_sdf = spark.read.table("workspace.default.overnightstays")
Aggregate the data by month and look at the results. The display command initially shows the data as table, but it also comes with a Graphical User interface to quickly create a plot. Follow the steps in the screencam to create a line chart, to see how the numbers have evolved over time. There is a clear pattern, that especially in July and August the numbers are at their highest.
import pyspark.sql.functions as F
display(overnightstays_sdf.groupBy("month").agg(F.sum("overnightstays").alias("overnightstays")))
Now explore, whether we can visually get a feel for an overall trend in the data, whether the numbers tend to go up or down over time. Have the Assistant do the work of writing the code with this request:
Create a new DataFrame that has the year in a new column. aggregate by year and show the result in a plotly chart
Plotly is a very common Python charting library. And indeed, when looking at the yearly totals, the numbers are increasing for the years. The year 2025 is not meaningful yet in this chart as the dataset only contains January to May for that year.
During the above data exploration we saw that the data has a trend (numbers are increasing over time) and some seasonality (ie numbers are largest in the summer). Let's train a Machine Learning that picks up on such patterns and can estimate future values. The Notebook created in this section can also be downloaded from the repository.
In case you are unsure about how to implement the following 4 code blocks in a new notebook called "020 Demand forecast", then this screencam will guide you along. It shows in a quick scroll through what the output should look like.
Read the historic data again into a PySpark DataFrame.
overnightstays_sdf = spark.read.table("workspace.default.overnightstays")
Since we want to create a monthly forecast on the total values, aggregate the history by month. Since we are working with the PySpark DataFrame, the built-in Spark engine is doing the work.
import pyspark.sql.functions as F
overnightstaysmonthly_sdf = overnightstays_sdf.groupBy("month").agg(F.sum("overnightstays").alias("overnightstays"))
display(overnightstaysmonthly_sdf)
The data needs to be prepared further, so that the Python package called Prophet can train a time-series model on it.
# Convert Spark DataFrame to Pandas DataFrame (the time series algorithm requires a Pandas DataFrame)
overnightstaysmonthly_df = overnightstaysmonthly_sdf.toPandas()
# Sort the DataFrame by date
overnightstaysmonthly_df = overnightstaysmonthly_df.sort_values('month')
# Rename the columns, as required by the time-series algoritm Prophet
overnightstaysmonthly_df = overnightstaysmonthly_df.rename(columns={'month': 'ds', 'overnightstays': 'y'})
We want to use the Prophet package for the forecast, which still needs to be installed. Here we install it directly from the Notebook. However, in the "Good to know" section below you see a more elegant way of using additional Python package, by creating your own customised Base environment.
pip install prophet==1.1.7
Everything is now in place to finally train the time-series model. The following block of code contains multiple steps. It is useful to run these steps together under the "with mlflow.start_run() section as this allows to log information about the training run in SAP Databricks. The following code:
Yes, in a real project this code would be even more elaborate. For example you may want to try different model configurations to get even better forecast accuracy. And the accuracy should ideally be calculated on a hold-out sample. Currently the code uses the same data for training as well as for checking the model's accuracy. The accuracy should really be calculated on data the model has never seen before.
HINT: In case the following code gives the error "MLflow not available", please check in the Environment settings on the right, that the "Environment version" is set to 2. These environments are explained a bit further down in the "Good to know" section further below. For some people it defaults to version 1 (which doesn't include the mlflow library), for others it defaults to the required version 2.
import pandas as pd
from prophet import Prophet
from sklearn.metrics import mean_absolute_percentage_error
import matplotlib.pyplot as plt
import mlflow, mlflow.tracking._model_registry.utils
mlflow.tracking._model_registry.utils._get_registry_uri_from_spark_session = lambda: "databricks-uc"
with mlflow.start_run():
# Initialize the Prophet model
model = Prophet()
# Fit the model
model.fit(overnightstaysmonthly_df)
# Create a DataFrame that contains all dates for which a prediction is required, the future 12 months but also the known past for comparison
datestopredict_df = model.make_future_dataframe(periods=12, freq='MS')
# Forecast the future and known past
forecast_df = model.predict(datestopredict_df)
# Plot the predictions together with known past
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(forecast_df['ds'], forecast_df['yhat'], label='Forecast', color='red')
ax.plot(overnightstaysmonthly_df['ds'], overnightstaysmonthly_df['y'], label='Historical Data')
ax.fill_between(forecast_df['ds'], forecast_df['yhat_lower'], forecast_df['yhat_upper'], color='red', alpha=0.3)
ax.set_title('Demand forecast')
ax.set_xlabel('Month')
ax.set_ylabel('Total Overnightstays')
ax.legend()
ax.grid(True)
plt.show()
# Calculate and log model accuracy, here MAPE on training data
overnightstaysmonthly_df[["ds"]] = overnightstaysmonthly_df[["ds"]].apply(pd.to_datetime)
anctualsandpredicted_df = overnightstaysmonthly_df[['ds', 'y']].merge(forecast_df[['ds', 'yhat']], on='ds', how='inner', suffixes=('_left', '_right'))
prophet_mape = mean_absolute_percentage_error(anctualsandpredicted_df['y'], anctualsandpredicted_df['yhat'])
mlflow.log_metric("mape", prophet_mape)
# Log the chart in the Unity Catalog / Experiment
mlflow.log_figure(fig, "Forecast.png")
# Log the size of the training dataset
mlflow.log_metric("rowcount_training", overnightstaysmonthly_df.shape[0])
Run the above code and we see our prediction!
To persist the forecast, save it as DeltaTable. From here the data can be made available to SAP Datasphere and SAP Analytics Cloud. That part is outside the scope of this tutorial, please check the documentation.
HINT: In case the saving fails with the message "PERMISSION_DENIED", then another user has already created that table and you are not allowed to overwrite it. Just change the name of the table you are creating to proceed.
forecast_sdf = spark.createDataFrame(forecast_df)
forecast_sdf.write.mode("overwrite").saveAsTable("workspace.default.overnightstays_forecast")
The forecast is saved as DeltaTable, that means you can see it in the Catalog!
The information that was logged during the training is now available in the Experiments section. The model's MAPE (Median Absolute Percentage Error) is at 1.4%, 41 records were used during training and we can see the chart that compares actuals to predictions. Especially when trying out different settings for the Machine Learning algorithm, this tracking becomes very useful, for instance to select which configuration to use for the prediction.
Everything we need for our demand forecast is implemented and we can run it manually. That means we are also good to have it automated with a schedule. Straight from the notebook you can define the scheduling logic, ie the recurring interval and whether you want any notifications. The notebooks can also be triggered through APIs if you prefer.
A few things that might be good to know.
In the above code the Python package Prophet is installed directly from within a Notebook. That's ok, but not ideal for ongoing use. If you were repeatedly re-running all cells from the Notebook, then Python will try to re-install the package every time. It's generally better to centrally specify the packages you know you will be using long term. This also ensures consistency across projects and developers. In SAP Databricks you can specify reusable / shareable lists of Python packages that are to be installed through Base environments.
Create a Base environment by creating a file called base_env_prophet.yaml with this content:
dependencies:
- prophet==1.1.7
With this base file in place, you can
To test out that the Package is correctly installed through the base file, clear the environment as shown in the screencam and run the Python code in the cells.
Now that the table "overnightstays" has been used a few times, check out the lineage, which shows where the table is used and the statistics on how heavily that table is used.
From the distance most Notebooks look very much alike at first glance. I heard some great feedback that some visual / image on top of the notebook would help to differentiate. In the repository you find a few examples, in case you would like to use those to distinguish between Data Exploration, AI/ML Sandboxing and AI/ML Deployment.
Upload any images that you want to use into the Workspace. To keep things tidy I am creating a separate folder for them.
These images can then be shown in a Markup cell.
SAP Databricks is adding a dedicated Data Science environment to the SAP landscape. After this hands-on experience you hopefully have a first feel for how a Data Scientist can enrich the data in SAP Business Data Cloud with Machine Learning.
For your own projects, you will want to check out the documentation:
I hope you enjoyed getting hands-on with SAP Databricks!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 5 | |
| 4 | |
| 4 | |
| 4 | |
| 2 | |
| 2 | |
| 1 | |
| 1 | |
| 1 | |
| 1 |