Introduction
If you are a data scientist or make data analytics you probably spend a lot of time working in Jupyter notebooks. This is a convenient interactive tool that is widely adopted by the industry. Often you would do there a full variety of tasks starting from data exploration and ending by model training. Often data scientists run Jupyter environment on their local machines and access it via a localhost. This works perfectly fine, until there is a need of working as a team on a long running project. In this case it’s more efficient to have a shared and scalable environment that provides computational and storage capabilities along with preinstalled tools.
In this blog post I am going to share the experience we have got in the team while working on diverse data science projects. In our team we use SAP Data Intelligence. It is a platform that provides multiple tools for data processing including pipeline modeler, integrated Jupyter environment, data lake and many more. Instead of diving deep into the theory I am going to present a practical example of building a simple price forecast model. So, prepare to get your hands dirty…
Avocados
Let’s suppose you are a trader and have an interest in forecasting avocado prices. For example, you would like to predict them two weeks ahead. Fortunately, there is an open
dataset, so we can analyze it and train a regression model. Let’s start by logging into SAP Data Intelligence and opening ML Scenario Manager.
Then create a new scenario for your project just press the “+” button and enter a name.
I usually do the data exploration and preparation tasks in Jupyter notebook. It provides a convenient and interactive python environment. SAP Data Intelligence has it already in place. Just navigate into the Notebooks section and press “+” button to start a new Jupyter environment. Enter a name and click create.
To start with data exploration, you first need to bring Avocado dataset into the working environment. The original file “avocado.csv” is quite small. You can just download it
here and then upload to Jupyter Lab by clicking the “Upload Files” button.
Now the csv file is in the user environment. However, in real project it is necessary to persist experiment results and to share them with the rest of the team. The recommended way of doing it in SAP Data Intelligence is creating an artifact. You can easily do it from a notebook by using “sapdi” library.
import sapdi
from sapdi.artifact.artifact import ArtifactKind
artifact = sapdi.create_artifact(description="Avocado prices dataset",
artifact_name="Avocado", artifact_kind=ArtifactKind.DATASET)
with artifact.add_file(file_name='avocado.csv').get_writer() as dl_file:
dl_file.write(open("avocado.csv").read())
Now the newly created artifact is attached to the scenario and can be accessed by its id. You can switch back to the Scenario Manager and find the dataset there.
To start with data exploration, I suggest loading the data into a Pandas data frame. The library is already preinstalled, so you can import it into the notebook and read data directly from the artifact object.
import io
import pandas as pd
artifact = sapdi.get_artifact(artifact_id=artifact.artifact_id)
file_handler = artifact.open_file(file_name='avocado.csv')
avocado_df = pd.read_csv(io.BytesIO(file_handler.read()))
The dataset is visualized in the picture above. In this example we aim to train a model that predicts the average price across the whole US. Thus, you can remove redundant columns and aggregate the dataset by the date. Additionally, you need to fix data types and extract precise date information such as year, month, etc.
filter = avocado_df['type'] == 'conventional'
prices_df = avocado_df.where(filter).groupby('Date').agg({'AveragePrice':'mean'}).reset_index()
prices_df['Date'] = pd.to_datetime(prices_df['Date'], format='%Y-%m-%d')
prices_df['Weekday'] = prices_df.apply(lambda row: row['Date'].weekday(), axis=1)
prices_df['Month'] = prices_df.apply(lambda row: row['Date'].month, axis=1)
prices_df['Day'] = prices_df.apply(lambda row: row['Date'].day, axis=1)
prices_df['Timestamp'] = prices_df.apply(lambda row: row['Date'].timestamp(), axis=1)
prices_df['Year'] = prices_df.apply(lambda row: row['Date'].year, axis=1)
prices_df = prices_df.sort_values(by=['Date'])
The result data frame shows that all data is given for Sundays on a weekly basis. This means that weekday column is not relevant and can be removed as well. Now let’s visualize the data to see the price trend using “matplotlib” library that is also preinstalled in Jupyter Lab.
There is not much data available but looking into the chart we might assume that prices tend to increase towards the end of the year and decrease during an early spring period.
Training a model
The main goal of this project is to build a model that makes a price prediction 2 weeks ahead. This is a time series forecast task and gradient boosting algorithm might be a good choice. At first, let’s prepare the dataset. The model requires to know the previous values of prices. This can be achieved by introducing lag columns that contain prices 2, 3 and 4 weeks back. First 3 rows end up having null values, so you can just drop them.
shift2_df = prices_df.shift(2).loc[:, ['AveragePrice']].rename(columns={'AveragePrice': 'AvgPrice2'})
shift3_df = prices_df.shift(3).loc[:, ['AveragePrice']].rename(columns={'AveragePrice': 'AvgPrice3'})
shift4_df = prices_df.shift(4).loc[:, ['AveragePrice']].rename(columns={'AveragePrice': 'AvgPrice4'})
prices_shift_df = pd.concat([prices_df, shift2_df, shift3_df, shift4_df], axis=1, sort=False)
prices_shift_df = prices_shift_df.iloc[4:]
To train a model the dataset needs to be split to train and test subsets. Let’s use “numpy” library to do it. And yes, it’s also preinstalled, so just import it.
import numpy as np
np.random.seed(0)
split_mask = np.random.rand(len(prices_shift_df)) < 0.8
x_df = prices_shift_df.loc[:, ['Weekday', 'Month', 'Day',
'Timestamp', 'Year', 'AvgPrice2',
'AvgPrice3', 'AvgPrice4']]
y_df = prices_shift_df.loc[:, ['AveragePrice']]
x_train = x_df[split_mask]
x_test = x_df[~split_mask]
y_train = y_df[split_mask]
y_test = y_df[~split_mask]
There are many ways to train a gradient boosting model. In this example I suggest using “XGBRegressor”. This library provides an efficient training algorithm and shares “sklearn” compatible API. Internally, it depends on “sklearn”, so let’s install both libraries into the notebook.
!pip install --user scikit-learn==0.22.2.post1 xgboost==1.0.2
Finally, everything is ready for training.
from xgboost import XGBRegressor
model = XGBRegressor(max_depth=10, n_estimators=1000, seed=0)
model.fit(x_train,
y_train,
eval_metric="rmse",
eval_set=[(x_train, y_train), (x_test, y_test)],
early_stopping_rounds=10)
The training process always has a random factor. In my environment it ends up having 12 trees with the root mean square error equal 0.07031. It looks like a low value, but it’s still hard to estimate how good the model performs. Let’s predict values for the whole existing timeline and additionally several months ahead. The plot below visualizes the forecast.
The original dataset is quite small and it’s hard to get reliable results. The predictions done by the model (blue line) fit quite good, but this is the same time frame that was used for training. It means that the model might overfit. The green line shows the modeling which goes a year ahead. It seems to follow the trend and looks quite promising. However, making a long forecast is a way more complicated task than the original two weeks prediction requirement. So, one should treat such results carefully.
Generally, this model can be used as a baseline for further research. Let’s assume that this is a long running team project and it will have multiple iterations in future. So, let’s persist the baseline model and make it available to teammates. The desired way of doing it in SAP Data Intelligence is creating a model artifact.
import pickle
artifact = sapdi.create_artifact(description="Avocado forecast model",
artifact_name="GBR_Model",
artifact_kind=ArtifactKind.MODEL)
with artifact.add_file(file_name='model.pkl').get_writer() as dl_file:
pickle.dump(model, dl_file)
Now the original dataset, notebook and model are persisted, and you can find them in Scenario Manager. Later, your colleagues can reproduce the results by accessing the dataset artifact and the notebook. Additionally, they can deploy a trained model using model serving pipeline. Now to finish the experiment, just click “Create Version” button in the top right corner and provide a description.
The provided example is just a small experiment used for illustration purposes. Real life tends to bring a way bigger tasks that require a collaborative work of multiple people. This article gives you some hints how to use SAP Data Intelligence for working on large projects and solving complex challenges. I hope you find this blog post helpful. Please feel free to share your thoughts and ask questions.
P.S.:
Here you can find the original Jupyter notebook used in this article.