Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
xinchen
Product and Topic Expert
Product and Topic Expert
0 Kudos
1,574

1. Introduction

Explainability in time series forecasting aims to provide insights into time series models based decision-making use cases, which is crucial for building trust into reliable AI models, especially in high-stakes fields like finance and healthcare. Following up on our previous blog post Exploring ML Explainability in SAP HANA PAL – Classification and Regression, we dive into the topic of explainability of  SAP HANA Predictive Analysis Library (PAL) time series functions in this blog post. 

Here, you will gain key insights, including:

 

2. Explainability in Time Series

When discussing time series, we typically refer to both univariate and multivariate types. A univariate time series consists of a single sequence of data points observed over time, while a multivariate time series involves multiple interdependent sequences, including both endogenous and exogenous variables. In the field of machine learning, time series forecasting is often approached as a regression problem. This involves leveraging feature engineering techniques such as lag features and rolling window statistics to capture the temporal dynamics of the data. By doing so, we can leverage tools and frameworks from regression analysis on time series data, facilitating the analysis of exogenous variable explainability. 

Explainability in time series is a broad concept that encompasses a variety of methods and considerations aimed at making the behavior of time series data and the predictions made by time series models understandable and interpretable. In this article, we will describe explainability from the perspective of Global vs. Local. The global level refers to understanding the overall model behavior, while at the local level, the analysis focuses on the impact of specific features on individual predictions.

Global Explainability encompasses two main aspects: decomposition and the explainability of exogenous variables. Decomposition involves breaking down a time series into several fundamental components like trend, seasonality, and residual fluctuations, which helps in understanding the structure and behavior of the time series. Considering decomposition, some models are inherently more interpretable due to their structure. For example, Bayesian Structural Time Series (BSTS) models provide a decomposition of the time series into trend, seasonal, residual, and regression components, offering a clear understanding of the underlying patterns. Bayesian Change Point Detection (BCPD) also decomposes a time series into trend, season, and random components with change points within both trend and season parts. 

Furthermore, the Global Explainability of exogenous variables is crucial for gaining insights into their impact on model predictions. Some tools and frameworks used in regression tasks such as permutation importance, can be adapted for time series analysis. Permutation importance assesses the importance of exogenous variables by measuring the decrease in the model's performance when the values of exogenous variables are permuted.

Local Explainability of exogenous variables, on the other hand, pertains to the contribution of exogenous variables to each specific prediction. Local methods like Shapley Values offer granular explanations for individual predictions by quantifying the impact of each exogenous variable on the output.

However, there are many challenges inherent in time series explainability, such as the complexity of capturing nonlinear relationships and the trade-off between model accuracy and interpretability. Some models, particularly deep learning models like LSTM (Long Short-Term Memory networks), are often seen as "black boxes" due to their complexity, making it hard to understand how they arrive at predictions.

 

3. PAL Time Series Explainability

SAP HANA Predictive Analysis Library (PAL) offers a rich set of selected time series algorithms along with embedded explainability features. For instance, PAL encompasses ARIMA and its extensions (ARIMAX, SARIMA, Vector ARIMA), Unified Exponential Smoothing (single/double/triple/auto), Additive Model Time Series Analysis (AMTSA), and a spectrum of influential neural networks such as Long-Term Series Forecasting (LTSF, including DLinear, NLinear, XLinear), Attention, Long Short-Term Memory (LSTM) networks, among others. In the subsequent paragraphs, we will explore the various time series  algorithms in PAL through the lens of Global versus Local explainability.

Global Explainability firstly involves Global Decomposition of components within the following algorithms:

 

Algorithm Name

Components Decomposed

Distinctive Features

AMTSA

trend, seasonality, holidays

offers detailed breakdown of seasonality ( yearly, weekly, daily, or customized patterns) and holiday effects

ARIMA

trend, seasonality, transitory, irregular

applies filtering methods in Digital Signal Processing for the decomposition

Bayesian Structural Time Series (BSTS)

trend, seasonality, residuals

utilizes Bayesian methods to estimate parameters and model components

Bayesian Change Point Detection (BCPD)

trend, seasonality, random

identifies change points within trend and seasonality, adapting to shifts in data

LTSF

trend, seasonality

specifically designed for long-term forecasting

Unified Exponential Smoothing

level, trend, seasonality

identifies components based on various exponential smoothing functions

 

Global explainability of exogenous variables in PAL is furthered obtained through Permutation Importance for Time Series, a method applicable across a range of time series models including ARIMA/Auto ARIMA, BSTS, AMTSA, and LTSF (only for XLinear). In addition, Random Decision Trees (RDT) serve as a model-free approach to get feature importance of exogenous variables.

Local Explainability of exogenous variables pertains to the analysis of individual predictions. Depending on the characteristics of various algorithms, PAL offers a suite of appropriate methods to provide interpretability, assisting in obtaining measurable forecasts. The methods used for various algorithms are as follows.

  • ARIMA / Auto ARIMA / AMTSA : To interpret the regressor part, we have adopted the renowned LinearSHAP algorithm which can generate the contribution of each exogenous feature to the forecasted values.
  • LTSF (XLinear) : The concept of reference data is used here to interpret the exogenous part. In the business domain, reference data often exists to denote the default behaviors or baseline values of exogenous variables. Specifically, the contribution of an exogenous variable will be zero if the value equals to its reference value, which provides an extra view of the relative impact of exogenous variables. Another prominent advantage of reference data is its better capability in explaining categorical features. Unlike continuous variables, we cannot find an intermediate value of a categorical variable. With reference data, we can start from a certain base as the reference and calculate the contribution afterwards.
  • BSTS : This algorithm is distinguished by its integration of Bayesian methods with time series analysis, allowing for the decomposition of the regressor part.
  • Attention : Two methods are provided for interpreting exogenous variables. The first leverages the Attention mechanism's inherent feature, using attention weights to assign contributions along the time dimension. The second employs BSTS as a surrogate model to evaluate feature-wide contributions.
  • LSTM : PAL utilizes BSTS as a surrogate model to obtain interpretability for the predictions made by LSTM models, thereby revealing how the model makes specific predictions based on input data.

 

4. Explainability Example

4.1 Use case and data

In this section, a publicly accessible fine particulate matter air pollution Beijing PM2.5 time series dataset will be used as a case study to explore HANA PAL ML explainability (PM2.5 consists of airborne particles with aerodynamic diameters of less than 2.5 μm). This hourly dataset, comprised of 43,824 rows and 11 columns, includes PM2.5 data taken from the U.S. Embassy in Beijing. All source code examples are making use of the Python Machine Learning Client for SAP HANA (hana-ml). Please note that the example code used in this section is solely for better understanding and visualizing ML explainability in SAP HANA PAL and it is not for productive use. 

In the following paragraphs, we will utilize Additive Model Time Series Analysis (AMTSA) algorithm (comparable to the well know “prophet” algorithm) to showcase the global and local methods PAL offers for model explainability. 

First, we imported the dataset from a CSV file into a table in a HANA instance. During the preprocessing stage, the four columns representing year, month, day, and hour were combined into a single 'date' column, and the rows with missing values were addressed. The dataset was then restructured into the following 9 columns. 

  • date: The timestamp of the record
  • pollution: PM2.5 concentration (μg/m³)
  • dew: Dew Point
  • temp: Temperature
  • press: Pressure (hPa)
  • wnd_dir: Combined wind direction
  • wnd_spd: Cumulative wind speed (m/s)
  • snow: Cumulative hours of snow
  • rain: Cumulative hours of rain

To simplify the use case and make it more manageable for demonstration purposes, we selected the first 1000 instances and named this subset df_1000. This subset was then divided into a training HANA dataframe, df_train, which contains 990 instances, and a testing HANA dataframe, df_test, which contains the remaining 10 instances. The first five rows of df_train are displayed in Figure 1.

Figure 1. The first five rows of df_trainFigure 1. The first five rows of df_train

 

Considering that air quality can be influenced by holidays, such as changes in vehicle travel and industrial production, we created a separate DataFrame named df_holiday to indicate the dates of weekends and national holidays. Please ensure that the granularity of the dates aligns with that of df_train and include the information about holidays covered in the forecasting data of df_test. Currently, it includes two types of holidays: 'New Year’s Day' and 'weekend'. The first five rows of df_holiday are displayed in Figure 2.

Figure 2. The first five rows of df_holidayFigure 2. The first five rows of df_holiday

 

4.2 Fitting a AMSTA Model

Then, we instantiate an AMTSA object , named "amf " and train a model with df_train and df_holiday.

>>> from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
>>> amf = AdditiveModelForecast()
>>> amf.fit(data=df_train, key='date', endog='pollution',
                  exog=['dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain'],
                  categorical_variable=['wnd_dir'], holiday=df_holiday)

Next, the predictions are generated using the predict() function, which provides both the forecasted values and the 80% uncertainty interval.

>>> forecast = amf.predict(data=df_test, key='date')
>>> forecast.head(5).collect()

Figure 3. The Forecasted PM2.5 Concentrations with 80% Uncertainty IntervalFigure 3. The Forecasted PM2.5 Concentrations with 80% Uncertainty Interval

We also provide a forecast_line_plot() function (as shown in Figure 3) to visually represent the forecast alongside the actual data, providing a clear comparison of the model's predictive performance and associated uncertainty. 

>>> from hana_ml.visualizers.visualizer_base import forecast_line_plot
>>> forecast_line_plot(pred_data=forecast, actual_data=df_test.select(['date', 'pollution']),
                                    confidence=("YHAT_LOWER", "YHAT_UPPER"))

Figure 4. The Forecasted and Real ValuesFigure 4. The Forecasted and Real Values

    

4.3 Global Explainability of Exogenous Variables

The importance of exogenous variables can be easily obtained through the get_permutation_importance() method which employs the model associated with the object (the attribute in our case is 'amf.model_'). This method offers a suite of metrics, including MSE (the default), RMSE, MPE, and MAPE, to evaluate the impact of permuting the exogenous variables on the model's predictive accuracy. The code implementation and the resulting permutation importance values are displayed in Figure 5.

>>> res = amf.get_permutation_importance(data=df_test)
>>> res.collect()

Figure 5. The Importance of Exogenous Variables with AMTSA modelFigure 5. The Importance of Exogenous Variables with AMTSA model

The result in Figure 5 reveals that temperature (temp), dew point (dew), and atmospheric pressure (press) are the three key factors with the most significant influence on PM2.5 concentration levels within the context of default model. 

In addition to this, we explored a model-free approach by setting model = 'rdt' within the 'amf' object. This approach leverages Random Decision Trees to gain the importance of exogenous variables, and the outcomes are presented in Figure 6. For this analysis, we used the df_1000 dataset, which was divided into distinct training and testing subsets during the permutation importance computation.

>>> res_rdt = amf.get_permutation_importance(data=df_1000, model='rdt')
>>> res_rdt.collect()

Figure 6. The Importance of Exogenous Variables with RDT modelFigure 6. The Importance of Exogenous Variables with RDT model

Interestingly, this model-free method identified dew point (dew), wind speed (wnd_spd), and wind direction (wnd_dir) as the three key factors. Comparing the importance rankings of exogenous variables derived from these two models, it is crucial to recognize that permutation importance method does not offer a direct measure of causality. Instead, it indicates the extent to which the model's predictions are sensitive to variations in the input variables, in our case, the exogenous variables.

 

4.4 Global Decomposition

As AMTSA utilizes a decomposable time series model that consists of three principal components - trend, seasonality, and holidays - it is possible to provide a detailed breakdown of the contributions from each component and the exogenous variables (shown in Section 4.5) to every forecasted value. 

To enable this feature within the predict() function, you simply need to set the parameter show_explainer  to True. This will return a attribute named explainer_ including the trend, seasonality, and holidays components and the individual contributions of any exogenous variables. Furthermore, if you wish to delve deeper into the decomposition of the seasonal and holiday components, the model offers two additional parameters: decompose_seasonality and decompose_holiday. By setting these parameters to True, you can obtain a more detailed analysis of how these components individually influence the forecast.

>>> res = amf.predict(data=df_test, key='date', show_explainer=True,
                                   decompose_seasonality=True, decompose_holiday=True)
>>> print(amf.explainer_.head(5).collect())

Figure 7. The amf.explain_ DataframeFigure 7. The amf.explain_ Dataframe

In the 'amf.explainer_' DataFrame shown in Figure 7, a specific prediction is decomposed into TREND, SEASONAL, and HOLIDAY columns and an 'EXOGENOUS' column that presents the impact of each external variable in JSON format. The HOLIDAY contribution registers zero, which may be attributed to Beijing not being an industrial city and the implementation of strict traffic management measures during peak work hours.

For a more user-friendly examination of 'amf.explainer_', we provide a Time Series Report including an array of graphs realted to time series analysis and the model itself. The report includes an Explainability page that displays associated graphs for the 'amf.explainer_' DataFrame. The following code demonstrates how to generate a Time Series Report, and a plot of the seasonality component within the Explainability page is depicted in Figure 8.

>>> from hana_ml.visualizers.unified_report import UnifiedReport
>>> amf.build_report()
>>> UnifiedReport(amf).display()

Figure 8. The Seasonality Component with daily and weekly segmentsFigure 8. The Seasonality Component with daily and weekly segments

 

4.5 Local Explainability of Exogenous Variables

AMTSA employs linearSHAP as the method for calculating the contribution of exogenous variables, as evident in the 'EXOGENOUS' column of Figure 7. On the Explainability page of the Time Series Report, a force plot of Exogenous offers a clear visualization of the influence of individual variables on a specific prediction. By clicking on the '+' sign in front of each row, you can expand to view the detailed force plot for that particular prediction, as shown in Figure 9.

Figure 9. Force Plot of Exogenous VariablesFigure 9. Force Plot of Exogenous Variables

Taking the first row of the prediction data in Figure 9 as an example, it is evident that when the temperature is minus 9 degrees Celsius, it increases the PM2.5 index, with a contribution value of approximately 13.19. Similarly, a wind direction of 'NW' also contributes to an increase in the PM2.5 index. Conversely, dew point and air pressure have negative effects, reducing the PM2.5 index.

In addition to the force plot, the contributions of exogenous variables to predictions are also presented in a bar chart within a single figure, as shown in Figure 10. This graph is designed to provide users with a clearer and more immediate understanding, and it is evident that ‘dew’ and ‘pressure’ have the most significant negative impact.

Figure 10. Bar Plot of Exogenous VariablesFigure 10. Bar Plot of Exogenous Variables

If we revisit the results from the global permutation importance method (top 3: temp, dew, press), we find that there can be significant differences even among the top-ranking variables. This is because the global permutation importance method offers a comprehensive overview of variable importance across all predictions. In contrast, the local linearSHAP method provides a granular, prediction-specific analysis, revealing the intricate dynamics of external influences on individual forecasts. Therefore, PAL provides both global and local methods for time series forecasting algorithm, assisting users in gaining a deeper understanding of their models' behavior and the factors influencing their predictions. 

 

5. Summary

In this post, we have explored the concept of machine learning explainability for time series, utilizing the SAP HANA Predictive Analysis Library (PAL). We also demonstrated a use case with Python API (hana-ml) to build a Additive Model Time Series Analysis model on a publicly available multivariate time series dataset. Our aim was to showcase the extensive capabilities of PAL in time series explainability, which extend beyond mere forecasting to encompass embedded explainability features designed to clarify model behavior and enable businesses make more informed and confident decisions. 

 

Other Useful Links:

Install the Python Machine Learning client from the pypi public repository: hana-ml

HANA Predictive Analysis Library Documentation

For other blog posts on hana-ml: 

  1. Global Explanation Capabilities in SAP HANA Machine Learning
  2. Exploring ML Explainability in SAP HANA PAL – Classification and Regression
  3. Fairness in Machine Learning - A New Feature in SAP HANA Cloud PAL
  4. A Multivariate Time Series Modeling and Forecasting Guide
  5. Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA
  6. Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA
  7. Anomaly Detection in Time-Series using Seasonal Decomposition
  8. Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA
  9. Learning from Labeled Anomalies for Efficient Anomaly Detection
  10. Time-Series Modeling and Analysis using SAP HANA Predictive Analysis Library(PAL)
  11. Import multiple excel files into a single SAP HANA table
  12. COPD study, explanation and interpretability with Python machine learning client for SAP HANA
  13. Model Storage with Python Machine Learning Client for SAP HANA
  14. Identification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA