Artificial Intelligence Blogs Posts
cancel
Showing results for 
Search instead for 
Did you mean: 
AndreasForster
Product and Topic Expert
Product and Topic Expert
16,203

000 logos white.png

With SAP Databricks we now have a dedicated environment for Data Scientists within the SAP Business Data Cloud. This blog gives a practical introduction to this bespoke Databricks edition by implementing a bare bones demand forecast. SAP Databricks includes important functionality beyond what is explained in this entry-level tutorial., for example data sharing, experiment tracking or AutoML.

The integration of SAP Databricks is adding a new option to the SAP Business Technology Platform (BTP), giving customers more choice when creating custom extensions or applications. Existing components on the BTP, which you might already be familiar with, remain very relevant and strategic, such as:

  • SAP AI Core, which provides for example access to a long list of Large Language Models through its Generative AI Hub. Some of the models are even hosted on SAP's own physical infrastructure, giving increased security (currently Mistral and Aleph Alpha).
  • SAP Document AI, which extracts information from documents such as PDFs, Excel, images, ...
  • SAP HANA Cloud and SAP Datasphere, with their built-in multi-model capabilities, ie Machine Learning, embeddings generation for Text, Vector engine, Graph engine, Geospatial 
  • SAP Build, Joule Studio for adding new skills to our digital assistant Joule, and the ability to create new AI agents on the roadmap. This blog (series) on SAP's AI Agent Architecture by our CTO and Chief AI Officer Philipp Herzig gives an excellent overview of the wider picture.
  • SAP Databricks can also be used as development environment for the above components. This blog by @san_tran for instance shows a SAP Databricks project that integrates with Large Language Models on SAP AI Core. 

And remember to check on the AI/ML functionality that has already been built into SAP standard applications, or what is in the pipeline to be released. Both released and planned functionality is shown in the Roadmap Explorer.

Note: All data, code and images that are used in this blog can be downloaded from this repository: Hands-on Tutorial SAP Databricks. In that repository you find the individual files but also the whole project exported as Databricks Archive.

Before we get going, big thanks to Stojan Maleschlijski for all the great collaboration, including exploring the SAP Databricks capabilities together! @stojanm 

 

Table of contents

 

Use Case

This blog aims to give a first introduction for carrying out a Machine Learning project on SAP Databricks. Trying to simulate a demand forecast, we will predict how many nights people spend in a hotel in Switzerland. We use some granular data that is kindly shared by the Swiss Statistics department.

A Data Scientist (you) creates a monthly forecast and sets up a schedule for the forecast to be updated every month.  This data would then become part of a business process, maybe you want to provide the data to business users in a dashboard. This tutorial however focusses solely on creating and scheduling the forecast in SAP Databricks.

It's a simple time-series forecasting example, but the concept is still extremely relevant for so many different business requirements. Instead of forecasting how many visitors are staying overnight, you might need to forecast your sales quantities, your cash flow or even for topics that don't come immediately to mind. One customer for instance is using time-series forecasting to improve the data quality of external data that is loaded into the system. If a new value is outside the predicted range, there might be a data quality issue and IT can follow up, whether the data is correct or whether indeed a data quality issue cropped up.

 

Architecture

In a productive system a typical architecture and data flow would look like this.

  1. A Data Product is pushed into the Business Data Cloud Object Store
  2. This Data Product is shared with SAP Databricks
  3. SAP Databricks creates and saves a forecast as DeltaTable into the Object Store
  4. This table is registered as Custom Data Product in the Business Data Cloud
  5. SAP Datasphere installs the Custom Data Product, making it accessible for further data modelling and visualisation in SAP Analytics Cloud

100 architecture prod.png

 

The architecture of this hands-on exercise is focused purely on getting some familiarity with SAP Databricks. The only interface we will be using is SAP Databricks. We use it to upload the data and to go through the steps of a Machine Learning project.

114 hands-on architecture.png

 

Prerequisites

Getting started is pretty easy.

  • You will just need to have access to an instance of SAP Databricks. This could be the SAP Business Data Cloud trial.
  • Some familiarity with SQL and especially Python would be very useful. 
  • And maybe just some curiosity to try something new.

 

Upload data

Start by bringing the historic data, on which we want to train a model, into the system. Upload the dataset OVERNIGHTSTAYS.csv with the Graphical User Interface of SAP Databricks. Follow the steps shown in this screencam. In case that you are working with the trial, then you need to change the Catalog drop-down as shown to "workspace".

HINT: In case the upload fails with the message "Table with same name already exists", then another user has already created the table in this shared environment. Due to access rights you won't be able to see that other user's table. Just change the table name  in your upload to something unique and remember to use this name when accessing the table further on.

010 data upload.gif

The data is uploaded. it is showing in the Catalog as (Delta)Table. Physically it is stored in the Object Store. The table's Overview tab lists the column names and their data types:

200 table.png

 

Move to the table's "Sample Data" tab to see some of the uploaded rows.

210 sample data.png

The first lines shows that in January 2022 there were 392.805 nights spent by Swiss residents in the area of Graubünden. In case you are not that familiar with Swiss geography, Graubünden is the largest Canton in Switzerland, it's in the Alps and includes beautiful places like St. Moritz and Davos. 

 

Explore the data 

You have two options to explore the data in SAP Databricks.

  • Either query the DeltaTable directly with SQL
  • Or use a Notebook, in which you can use both SQL or Python

Data exploration with the SQL Editor

Let's look closer into the data. You can start with some SQL statements. You can go into the SQL Editor and type in your own syntax. The screencam shows two simple examples. We learn for instance that we have data from January 2022 to May 2025 to work with.

  • SELECT * from overnightstays
  • SELECT min(MONTH), max(month) from overnightstays

1 SQL Editor.gif

 

Or describe what you are looking for and have SAP Databrick's write the SQL for you! Click that small red-ish star icon to toggle on the Databricks Assistant.

230 assistant toggle.png

 

Just try out what you are interested in. I was wondering for example:

  • which regions are in overnightstays?
  • use table overnightstays to determine how many overnight stays were they by region, sort the results
  • use table overnightstays to summarise overnightstays by countryofresidence

2 sql assistant.gif

 

It turns out that most overnight stays in Switzerland are by Swiss residents, followed by German residents. Maybe surprisingly residents from the United States are a close third.

240 overnight by countryofresidence.png

 

Data exploration with a Notebook

Now continue the data exploration in a Notebook. Databricks Notebooks are quite special in that they can contain more than one scripting language. SAP Databricks supports both SQL as well as Python. The Notebook created in this section can also be downloaded from the repository.

Begin by creating a folder in which we will save our Notebooks. Go into the "Workspace" section, which is where such files are kept together. Within your user's workspace create a folder called "Demand forecast".

250 create folder.gif

 

In that new folder create a first Notebook as shown in the next screencam. Rename it to "010 Data exploration". And add a short header in Markdown at the top. Markdown is a common way to add comments and context to the code in a Notebook. Adding "#" character at the beginning of a line formats the text as top-level heading. Two "#" characters make it a second-level heading, and so on. Without any "#" character the following text is formatted just normally as plain text. 

To add the Markdown, you need to change the cell's language selector to "Markdown" as shown in the screencam. Enter your text. When done, click outside the cell and the formatting kicks in.

260 create notebook.gif

 

Since the cells can also execute SQL code, add the most recent SQL statement from the SQL Editor.

SELECT
  COUNTRYOFRESIDENCE,
  SUM(OVERNIGHTSTAYS) AS total_overnightstays
FROM
  workspace.default.overnightstays
GROUP BY
  COUNTRYOFRESIDENCE
ORDER BY
  total_overnightstays DESC

 

The screencam shows how a new cell can be added to the Notebook, by hovering with the mouse at the bottom of the cell above. Select "Code" as cell type. Change the cell's language selector to "SQL", paste the code and run it with the blue play button. You will see the same result as before in the "SQL Editor".

270 notebook with sql.gif

 

Let's switch to Python for the data exploration so that you have tried all the options. Create a new Code cell, its language selector might already be on "Python" by default. Begin by loading the data into a PySpark DataFrame using the table's fully qualified path.

overnightstays_sdf = spark.read.table("workspace.default.overnightstays") 

 

Aggregate the data by month and look at the results. The display command initially shows the data as table, but it also comes with a Graphical User interface to quickly create a plot. Follow the steps in the screencam to create a line chart, to see how the numbers have evolved over time. There is a clear pattern, that especially in July and August the numbers are at their highest.

import pyspark.sql.functions as F
display(overnightstays_sdf.groupBy("month").agg(F.sum("overnightstays").alias("overnightstays")))

280 display chart.gif

 

Now explore, whether we can visually get a feel for an overall trend in the data, whether the numbers tend to go up or down over time. Have the Assistant do the work of writing the code with this request:

Create a new DataFrame that has the year in a new column. aggregate by year and show the result in a plotly chart

Plotly is a very common Python charting library. And indeed, when looking at the yearly totals, the numbers are increasing for the years.  The year 2025 is not meaningful yet in this chart as the dataset only contains January to May for that year.

290 yearly totals.gif

 

Create a forecast with SAP Databricks

During the above data exploration we saw that the data has a trend (numbers are increasing over time) and some seasonality (ie numbers are largest in the summer). Let's train a Machine Learning that picks up on such patterns and can estimate future values. The Notebook created in this section can also be downloaded from the repository.

In case you are unsure about how to implement the following 4 code blocks in a new notebook called "020 Demand forecast", then this screencam will guide you along. It shows in a quick scroll through what the output should look like.

400 forecast prep.gif

 

Read the historic data again into a PySpark DataFrame.

overnightstays_sdf = spark.read.table("workspace.default.overnightstays") 

 

Since we want to create a monthly forecast on the total values, aggregate the history by month. Since we are working with the PySpark DataFrame, the built-in Spark engine is doing the work.

import pyspark.sql.functions as F
overnightstaysmonthly_sdf = overnightstays_sdf.groupBy("month").agg(F.sum("overnightstays").alias("overnightstays"))
display(overnightstaysmonthly_sdf)

 

The data needs to be prepared further, so that the Python package called Prophet can train a time-series model on it.

# Convert Spark DataFrame to Pandas DataFrame (the time series algorithm requires a Pandas DataFrame)
overnightstaysmonthly_df = overnightstaysmonthly_sdf.toPandas()

# Sort the DataFrame by date
overnightstaysmonthly_df = overnightstaysmonthly_df.sort_values('month')

# Rename the columns, as required by the time-series algoritm Prophet
overnightstaysmonthly_df = overnightstaysmonthly_df.rename(columns={'month': 'ds', 'overnightstays': 'y'})

 

We want to use the Prophet package for the forecast, which still needs to be installed. Here we install it directly from the Notebook. However, in the "Good to know" section below you see a more elegant way of using additional Python package, by creating your own customised Base environment.

pip install prophet==1.1.7

 

Everything is now in place to finally train the time-series model. The following block of code contains multiple steps. It is useful to run these steps together under the "with mlflow.start_run() section as this allows to log information about the training run in SAP Databricks.  The following code:

  • Trains the time series model
  • Creates a DataFrame with the future 12 months that are to be forecasted
  • Creates a forecast for the known past and the future 12 months
  • Calculates  and logs the model's accuracy on the known history
  • Plots and logs the known history against the predicted values
  • Logs the number of records that were used during training

Yes, in a real project this code would be even more elaborate. For example you may want to try different model configurations to get even better forecast accuracy. And the accuracy should ideally be calculated on a hold-out sample. Currently the code uses the same data for training as well as for checking the model's accuracy. The accuracy should really be calculated on data the model has never seen before. 

HINT: In case the following code gives the error "MLflow not available", please check in the Environment settings on the right, that the "Environment version" is set to 2. These environments are explained a bit further down in the "Good to know" section further below. For some people it defaults to version 1 (which doesn't include the mlflow library), for others it defaults to the required version 2.

import pandas as pd
from prophet import Prophet
from sklearn.metrics import mean_absolute_percentage_error
import matplotlib.pyplot as plt
import mlflow, mlflow.tracking._model_registry.utils

mlflow.tracking._model_registry.utils._get_registry_uri_from_spark_session = lambda: "databricks-uc"

with mlflow.start_run():

    # Initialize the Prophet model
    model = Prophet()

    # Fit the model
    model.fit(overnightstaysmonthly_df)

    # Create a DataFrame that contains all dates for which a prediction is required, the future 12 months but also the known past for comparison
    datestopredict_df = model.make_future_dataframe(periods=12, freq='MS')

    # Forecast the future and known past
    forecast_df = model.predict(datestopredict_df)

    # Plot the predictions together with known past
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(forecast_df['ds'], forecast_df['yhat'], label='Forecast', color='red')
    ax.plot(overnightstaysmonthly_df['ds'], overnightstaysmonthly_df['y'], label='Historical Data')
    ax.fill_between(forecast_df['ds'], forecast_df['yhat_lower'], forecast_df['yhat_upper'], color='red', alpha=0.3)
    ax.set_title('Demand forecast')
    ax.set_xlabel('Month')
    ax.set_ylabel('Total Overnightstays')
    ax.legend()
    ax.grid(True)
    plt.show()

    # Calculate and log model accuracy, here MAPE on training data
    overnightstaysmonthly_df[["ds"]] = overnightstaysmonthly_df[["ds"]].apply(pd.to_datetime)
    anctualsandpredicted_df = overnightstaysmonthly_df[['ds', 'y']].merge(forecast_df[['ds', 'yhat']], on='ds', how='inner', suffixes=('_left', '_right'))
    prophet_mape = mean_absolute_percentage_error(anctualsandpredicted_df['y'], anctualsandpredicted_df['yhat'])
    mlflow.log_metric("mape", prophet_mape)

    # Log the chart in the Unity Catalog / Experiment
    mlflow.log_figure(fig, "Forecast.png") 

    # Log the size of the training dataset
    mlflow.log_metric("rowcount_training", overnightstaysmonthly_df.shape[0])

 

Run the above code and we see our prediction!

410 forecast.gif

 

To persist the forecast, save it as DeltaTable. From here the data can be made available to SAP Datasphere and SAP Analytics Cloud. That part is outside the scope of this tutorial, please check the documentation.

HINT: In case the saving fails with the message "PERMISSION_DENIED", then another user has already created that table and you are not allowed to overwrite it. Just change the name of the table you are creating to proceed. 

forecast_sdf = spark.createDataFrame(forecast_df)
forecast_sdf.write.mode("overwrite").saveAsTable("workspace.default.overnightstays_forecast")

 

The forecast is saved as DeltaTable, that means you can see it in the Catalog!

420 forecast in catalog.gif

 

The information that was logged during the training is now available in the Experiments section. The model's MAPE (Median Absolute Percentage Error) is at 1.4%, 41 records were used during training and we can see the chart that compares actuals to predictions. Especially when trying out different settings for the Machine Learning algorithm, this tracking becomes very useful, for instance to select which configuration to use for the prediction.

430 catalog.gif

 

Schedule

Everything we need for our demand forecast is implemented and we can run it manually. That means we are also good to have it automated with a schedule. Straight from the notebook you can define the scheduling logic, ie the recurring interval and whether you want any notifications. The notebooks can also be triggered through APIs if you prefer.

500 schedule.gif

 

Good to know

A few things that might be good to know.

 

Base file

In the above code the Python package Prophet is installed directly from within a Notebook. That's ok, but not ideal for ongoing use. If you were repeatedly re-running all cells from the Notebook, then Python will try to re-install the package every time. It's generally better to centrally specify the packages you know you will be using long term. This also ensures consistency across projects and developers. In SAP Databricks you can specify reusable / shareable lists of Python packages that are to be installed through Base environments.

Create a Base environment by creating a file called base_env_prophet.yaml with this content:

dependencies:
  - prophet==1.1.7

600 create base env file.gif 

With this base file in place, you can

  • Step 1: Remove the pip install command from the "020 Demand forecast" Notebook

660 1 Delete cells.gif

 

  • Step 2: Specify that the Notebook will use the base file instead, to determine which packages to install. Here you can also select the Environment version. These environment versions are documented here. After selecting a different base file or after changing the environment version you need hit "Apply" and confirm for the change to take effect.

660 2 Apply base file.gif

 

To test out that the Package is correctly installed through the base file, clear the environment as shown in the screencam and run the Python code in the cells.

620 run with base file.gif

 

Lineage & Table usage insights

Now that the table "overnightstays" has been used a few times, check out the lineage, which shows where the table is used and the statistics on how heavily that table is used. 

550 lineage and insights.gif

 

Visuals

From the distance most Notebooks look very much alike at first glance. I heard some great feedback that some visual / image on top of the notebook would help to differentiate. In the repository you find a few examples, in case you would like to use those to distinguish between Data Exploration, AI/ML Sandboxing and AI/ML Deployment.

Upload any images that you want to use into the Workspace. To keep things tidy I am creating a separate folder for them.

650 image upload.gif

 

These images can then be shown in a Markup cell.

660 image display.gif

 

Summary

SAP Databricks is adding a dedicated Data Science environment to the SAP landscape. After this hands-on experience you hopefully have a first feel for how a Data Scientist can enrich the data in SAP Business Data Cloud with Machine Learning.

For your own projects, you will want to check out the documentation:

 

I hope you enjoyed getting hands-on with SAP Databricks!

12 Comments
Vitaliy-R
Developer Advocate
Developer Advocate
0 Kudos

Thank you, @AndreasForster and @stojanm! I was able to complete the main flow (without "Good to know" for now).

In a Basic Trial, the schema `workspace.default` seems to be shared, so I had to give a unique table name `overnightstays_witalij` to avoid authorization issues for the existing (but not visible) `overnightstays` table.

FYI, I tried cloning your git repo https://github.com/SAP-samples/mee-samples.git into the workspace, instead of downloading and uploading individual files. It worked fine with notebooks, but I did not manage to create a table from the CSV file. I still need to look at how to do that 🤓

For the environment, I decided to try using Dependencies -> Added -> Apply, and it worked well:
VitaliyR_0-1754423849038.png

But I had to include `mlflow`, not just `prophet`, Python packages.

Once again, thank you for sharing this!!

andreapistore
Discoverer
0 Kudos

Thanks a lot. Really helpful and clear!

AndreasForster
Product and Topic Expert
Product and Topic Expert
0 Kudos

Many thanks @Vitaliy-R, Good to know that the data might need to be uploaded in the trial with different table names. Did you receive a specific message when trying with "overnightstays"? I can then add a heads-up in the text.

When using "Environment version 2" the mlflow package is already pre-installed. I have added a hint to that step. This small update is currently in review with the editors of the AI community and should hopefully be released soon.

thorsten_fueg
Discoverer
0 Kudos

Hi @AndreasForster,

nice hands-on. Thanks for sharing!

Can be the hands-on done with the following sizing? 😁

thorsten_fueg_0-1755867509330.png

Is there any information available related to useful sizing recommendations related to SAP Databricks?

Regards
Thorsten 

 

AndreasForster
Product and Topic Expert
Product and Topic Expert
0 Kudos

Thank you @thorsten_fueg !
There is a SAP Databricks Sizing Calculator, but that tools still needs additional guidance. You could check with your Account team at SAP for instance to go through this together https://www.databricks.com/product/pricing/sap-databricks-pricing-calculator 
Implementing the tutorial might require more Capacity Units than shown in your screenshot. The heavier you work with SAP Databricks, the more Capacity Units are needed. But you can also implement this example with the free trial https://www.sap.com/products/data-cloud/trial.html 

VikasParmar055
Explorer
0 Kudos

Thanks @AndreasForster  for the detailed walkthrough and code examples. This real-world use case really clarifies how Databricks and SAP can collaborate effectively in AI/ML scenarios. Super helpful for deepening our understanding!

Thanks

Vikas

NancyLu
Product and Topic Expert
Product and Topic Expert
0 Kudos

Thanks for sharing the deep-dive hands on for DBX. Really helpful for beginners.

Great work as always! 🙂@AndreasForster @stojanm

Doug
Product and Topic Expert
Product and Topic Expert
0 Kudos

Bravo Andreas & Stojan... 

RL09
Discoverer
0 Kudos

hi @AndreasForster , 

getting this error during environment variable set

Error applying: Environment version from base environment is enforced but undefined/invalid: None

Any thoughts,

BR.. Raj

AndreasForster
Product and Topic Expert
Product and Topic Expert

Hello @RL09 , Please try with environment version 2. Databricks has introduced newer environment versions after the blog was created, which might affect the outcome

AndreasForster_0-1756744823406.png

 

RL09
Discoverer

hi @AndreasForster , Excellent thank you. that worked. 🙂 . was trying with version 4, before , but didn't do the trick.
And thank you for the very organized training doc, easy to follow by anybody 🙂

BR Raj

datdd
Discoverer
0 Kudos

hi @AndreasForster , 
thank for your work,
I have an issue when training ML for predicting chart:
'Prophet' object has no attribute 'stan_backend'
Fail at line of code: 

model = Prophet()
Do you have any advices? Thank you so much,
DatDD