
Explainability in time series forecasting aims to provide insights into time series models based decision-making use cases, which is crucial for building trust into reliable AI models, especially in high-stakes fields like finance and healthcare. Following up on our previous blog post Exploring ML Explainability in SAP HANA PAL – Classification and Regression, we dive into the topic of explainability of SAP HANA Predictive Analysis Library (PAL) time series functions in this blog post.
Here, you will gain key insights, including:
When discussing time series, we typically refer to both univariate and multivariate types. A univariate time series consists of a single sequence of data points observed over time, while a multivariate time series involves multiple interdependent sequences, including both endogenous and exogenous variables. In the field of machine learning, time series forecasting is often approached as a regression problem. This involves leveraging feature engineering techniques such as lag features and rolling window statistics to capture the temporal dynamics of the data. By doing so, we can leverage tools and frameworks from regression analysis on time series data, facilitating the analysis of exogenous variable explainability.
Explainability in time series is a broad concept that encompasses a variety of methods and considerations aimed at making the behavior of time series data and the predictions made by time series models understandable and interpretable. In this article, we will describe explainability from the perspective of Global vs. Local. The global level refers to understanding the overall model behavior, while at the local level, the analysis focuses on the impact of specific features on individual predictions.
Global Explainability encompasses two main aspects: decomposition and the explainability of exogenous variables. Decomposition involves breaking down a time series into several fundamental components like trend, seasonality, and residual fluctuations, which helps in understanding the structure and behavior of the time series. Considering decomposition, some models are inherently more interpretable due to their structure. For example, Bayesian Structural Time Series (BSTS) models provide a decomposition of the time series into trend, seasonal, residual, and regression components, offering a clear understanding of the underlying patterns. Bayesian Change Point Detection (BCPD) also decomposes a time series into trend, season, and random components with change points within both trend and season parts.
Furthermore, the Global Explainability of exogenous variables is crucial for gaining insights into their impact on model predictions. Some tools and frameworks used in regression tasks such as permutation importance, can be adapted for time series analysis. Permutation importance assesses the importance of exogenous variables by measuring the decrease in the model's performance when the values of exogenous variables are permuted.
Local Explainability of exogenous variables, on the other hand, pertains to the contribution of exogenous variables to each specific prediction. Local methods like Shapley Values offer granular explanations for individual predictions by quantifying the impact of each exogenous variable on the output.
However, there are many challenges inherent in time series explainability, such as the complexity of capturing nonlinear relationships and the trade-off between model accuracy and interpretability. Some models, particularly deep learning models like LSTM (Long Short-Term Memory networks), are often seen as "black boxes" due to their complexity, making it hard to understand how they arrive at predictions.
SAP HANA Predictive Analysis Library (PAL) offers a rich set of selected time series algorithms along with embedded explainability features. For instance, PAL encompasses ARIMA and its extensions (ARIMAX, SARIMA, Vector ARIMA), Unified Exponential Smoothing (single/double/triple/auto), Additive Model Time Series Analysis (AMTSA), and a spectrum of influential neural networks such as Long-Term Series Forecasting (LTSF, including DLinear, NLinear, XLinear), Attention, Long Short-Term Memory (LSTM) networks, among others. In the subsequent paragraphs, we will explore the various time series algorithms in PAL through the lens of Global versus Local explainability.
Global Explainability firstly involves Global Decomposition of components within the following algorithms:
Algorithm Name | Components Decomposed | Distinctive Features |
AMTSA | trend, seasonality, holidays | offers detailed breakdown of seasonality ( yearly, weekly, daily, or customized patterns) and holiday effects |
ARIMA | trend, seasonality, transitory, irregular | applies filtering methods in Digital Signal Processing for the decomposition |
Bayesian Structural Time Series (BSTS) | trend, seasonality, residuals | utilizes Bayesian methods to estimate parameters and model components |
Bayesian Change Point Detection (BCPD) | trend, seasonality, random | identifies change points within trend and seasonality, adapting to shifts in data |
LTSF | trend, seasonality | specifically designed for long-term forecasting |
Unified Exponential Smoothing | level, trend, seasonality | identifies components based on various exponential smoothing functions |
Global explainability of exogenous variables in PAL is furthered obtained through Permutation Importance for Time Series, a method applicable across a range of time series models including ARIMA/Auto ARIMA, BSTS, AMTSA, and LTSF (only for XLinear). In addition, Random Decision Trees (RDT) serve as a model-free approach to get feature importance of exogenous variables.
Local Explainability of exogenous variables pertains to the analysis of individual predictions. Depending on the characteristics of various algorithms, PAL offers a suite of appropriate methods to provide interpretability, assisting in obtaining measurable forecasts. The methods used for various algorithms are as follows.
In this section, a publicly accessible fine particulate matter air pollution Beijing PM2.5 time series dataset will be used as a case study to explore HANA PAL ML explainability (PM2.5 consists of airborne particles with aerodynamic diameters of less than 2.5 μm). This hourly dataset, comprised of 43,824 rows and 11 columns, includes PM2.5 data taken from the U.S. Embassy in Beijing. All source code examples are making use of the Python Machine Learning Client for SAP HANA (hana-ml). Please note that the example code used in this section is solely for better understanding and visualizing ML explainability in SAP HANA PAL and it is not for productive use.
In the following paragraphs, we will utilize Additive Model Time Series Analysis (AMTSA) algorithm (comparable to the well know “prophet” algorithm) to showcase the global and local methods PAL offers for model explainability.
First, we imported the dataset from a CSV file into a table in a HANA instance. During the preprocessing stage, the four columns representing year, month, day, and hour were combined into a single 'date' column, and the rows with missing values were addressed. The dataset was then restructured into the following 9 columns.
To simplify the use case and make it more manageable for demonstration purposes, we selected the first 1000 instances and named this subset df_1000. This subset was then divided into a training HANA dataframe, df_train, which contains 990 instances, and a testing HANA dataframe, df_test, which contains the remaining 10 instances. The first five rows of df_train are displayed in Figure 1.
Figure 1. The first five rows of df_train
Considering that air quality can be influenced by holidays, such as changes in vehicle travel and industrial production, we created a separate DataFrame named df_holiday to indicate the dates of weekends and national holidays. Please ensure that the granularity of the dates aligns with that of df_train and include the information about holidays covered in the forecasting data of df_test. Currently, it includes two types of holidays: 'New Year’s Day' and 'weekend'. The first five rows of df_holiday are displayed in Figure 2.
Figure 2. The first five rows of df_holiday
Then, we instantiate an AMTSA object , named "amf " and train a model with df_train and df_holiday.
>>> from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
>>> amf = AdditiveModelForecast()
>>> amf.fit(data=df_train, key='date', endog='pollution',
exog=['dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain'],
categorical_variable=['wnd_dir'], holiday=df_holiday)
Next, the predictions are generated using the predict() function, which provides both the forecasted values and the 80% uncertainty interval.
>>> forecast = amf.predict(data=df_test, key='date')
>>> forecast.head(5).collect()
Figure 3. The Forecasted PM2.5 Concentrations with 80% Uncertainty Interval
We also provide a forecast_line_plot() function (as shown in Figure 3) to visually represent the forecast alongside the actual data, providing a clear comparison of the model's predictive performance and associated uncertainty.
>>> from hana_ml.visualizers.visualizer_base import forecast_line_plot
>>> forecast_line_plot(pred_data=forecast, actual_data=df_test.select(['date', 'pollution']),
confidence=("YHAT_LOWER", "YHAT_UPPER"))
Figure 4. The Forecasted and Real Values
The importance of exogenous variables can be easily obtained through the get_permutation_importance() method which employs the model associated with the object (the attribute in our case is 'amf.model_'). This method offers a suite of metrics, including MSE (the default), RMSE, MPE, and MAPE, to evaluate the impact of permuting the exogenous variables on the model's predictive accuracy. The code implementation and the resulting permutation importance values are displayed in Figure 5.
>>> res = amf.get_permutation_importance(data=df_test)
>>> res.collect()
Figure 5. The Importance of Exogenous Variables with AMTSA model
The result in Figure 5 reveals that temperature (temp), dew point (dew), and atmospheric pressure (press) are the three key factors with the most significant influence on PM2.5 concentration levels within the context of default model.
In addition to this, we explored a model-free approach by setting model = 'rdt' within the 'amf' object. This approach leverages Random Decision Trees to gain the importance of exogenous variables, and the outcomes are presented in Figure 6. For this analysis, we used the df_1000 dataset, which was divided into distinct training and testing subsets during the permutation importance computation.
>>> res_rdt = amf.get_permutation_importance(data=df_1000, model='rdt')
>>> res_rdt.collect()
Figure 6. The Importance of Exogenous Variables with RDT model
Interestingly, this model-free method identified dew point (dew), wind speed (wnd_spd), and wind direction (wnd_dir) as the three key factors. Comparing the importance rankings of exogenous variables derived from these two models, it is crucial to recognize that permutation importance method does not offer a direct measure of causality. Instead, it indicates the extent to which the model's predictions are sensitive to variations in the input variables, in our case, the exogenous variables.
As AMTSA utilizes a decomposable time series model that consists of three principal components - trend, seasonality, and holidays - it is possible to provide a detailed breakdown of the contributions from each component and the exogenous variables (shown in Section 4.5) to every forecasted value.
To enable this feature within the predict() function, you simply need to set the parameter show_explainer to True. This will return a attribute named explainer_ including the trend, seasonality, and holidays components and the individual contributions of any exogenous variables. Furthermore, if you wish to delve deeper into the decomposition of the seasonal and holiday components, the model offers two additional parameters: decompose_seasonality and decompose_holiday. By setting these parameters to True, you can obtain a more detailed analysis of how these components individually influence the forecast.
>>> res = amf.predict(data=df_test, key='date', show_explainer=True,
decompose_seasonality=True, decompose_holiday=True)
>>> print(amf.explainer_.head(5).collect())
Figure 7. The amf.explain_ Dataframe
In the 'amf.explainer_' DataFrame shown in Figure 7, a specific prediction is decomposed into TREND, SEASONAL, and HOLIDAY columns and an 'EXOGENOUS' column that presents the impact of each external variable in JSON format. The HOLIDAY contribution registers zero, which may be attributed to Beijing not being an industrial city and the implementation of strict traffic management measures during peak work hours.
For a more user-friendly examination of 'amf.explainer_', we provide a Time Series Report including an array of graphs realted to time series analysis and the model itself. The report includes an Explainability page that displays associated graphs for the 'amf.explainer_' DataFrame. The following code demonstrates how to generate a Time Series Report, and a plot of the seasonality component within the Explainability page is depicted in Figure 8.
>>> from hana_ml.visualizers.unified_report import UnifiedReport
>>> amf.build_report()
>>> UnifiedReport(amf).display()
Figure 8. The Seasonality Component with daily and weekly segments
AMTSA employs linearSHAP as the method for calculating the contribution of exogenous variables, as evident in the 'EXOGENOUS' column of Figure 7. On the Explainability page of the Time Series Report, a force plot of Exogenous offers a clear visualization of the influence of individual variables on a specific prediction. By clicking on the '+' sign in front of each row, you can expand to view the detailed force plot for that particular prediction, as shown in Figure 9.
Figure 9. Force Plot of Exogenous Variables
Taking the first row of the prediction data in Figure 9 as an example, it is evident that when the temperature is minus 9 degrees Celsius, it increases the PM2.5 index, with a contribution value of approximately 13.19. Similarly, a wind direction of 'NW' also contributes to an increase in the PM2.5 index. Conversely, dew point and air pressure have negative effects, reducing the PM2.5 index.
In addition to the force plot, the contributions of exogenous variables to predictions are also presented in a bar chart within a single figure, as shown in Figure 10. This graph is designed to provide users with a clearer and more immediate understanding, and it is evident that ‘dew’ and ‘pressure’ have the most significant negative impact.
Figure 10. Bar Plot of Exogenous Variables
If we revisit the results from the global permutation importance method (top 3: temp, dew, press), we find that there can be significant differences even among the top-ranking variables. This is because the global permutation importance method offers a comprehensive overview of variable importance across all predictions. In contrast, the local linearSHAP method provides a granular, prediction-specific analysis, revealing the intricate dynamics of external influences on individual forecasts. Therefore, PAL provides both global and local methods for time series forecasting algorithm, assisting users in gaining a deeper understanding of their models' behavior and the factors influencing their predictions.
In this post, we have explored the concept of machine learning explainability for time series, utilizing the SAP HANA Predictive Analysis Library (PAL). We also demonstrated a use case with Python API (hana-ml) to build a Additive Model Time Series Analysis model on a publicly available multivariate time series dataset. Our aim was to showcase the extensive capabilities of PAL in time series explainability, which extend beyond mere forecasting to encompass embedded explainability features designed to clarify model behavior and enable businesses make more informed and confident decisions.
Install the Python Machine Learning client from the pypi public repository: hana-ml
HANA Predictive Analysis Library Documentation
For other blog posts on hana-ml:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
12 | |
12 | |
11 | |
10 | |
10 | |
8 | |
8 | |
7 | |
7 | |
7 |