In a few related blog posts[1][2], we have shown the analysis of time-series using traditional approaches like seasonal decomposition, ARIMA and exponential smoothing. However, those traditional approaches have some drawbacks for time-series analysis, typically ineffective in dealing time-series with complicated trend, multiple seasonality as well as cyclic patterns(repeated with non-fixed period, like holidays, event days, etc). The aforementioned drawbacks can be largely overcome by a relatively new method called additive model time-series analysis(additive model analysis is often used for short in later context without confusion). This method is provided in the SAP HANA Predictive Analysis Library(PAL), and wrapped up in the Python machine learning client for SAP HANA(hana_ml).
In this blog post, we show readers how to apply additive model analysis to analyze time-series with specific case studies.
The rest of this blog post is organized as follows: firstly we give a brief introduction to additive model analysis, then we go to two specific case studies, inclusive of data description, modeling and forecasting, and finally we summarize the main content of this blog post.
Additive model analysis is a newly emerged approach for time-series modeling. Unlike traditional approaches(like ARIMA and exponential smoothing) that explore time-based dependencies among observations, it treats time-series modeling as a curve-fitting problem, and uses an additive model to fit/forecast time-series data. Under this setting, the given time-series would be decomposed into four components: trend, seasonality, cyclic patterns, and a random component. The formula is as follows:
𝑦(𝑡)=𝑔(𝑡)+𝑠(𝑡)+ℎ(𝑡)+ϵ(𝑡).
It can handle data with complicated trend(using piecewise linear approximation), multiple seasonality(daily, weekly, yearly and beyond, event if period being non-integer) with strong seasonal effects. Besides, the modeling of cyclic patterns also helps to explain variations with non-fixing period. As a consequence, additive model time-series analysis often has better performance when comparing with other classical methods for time-series modeling/analysis.
The first dataset of our interest in this blog post is the historical electricity system demand for every half-hour period(MW) in Singapore, which is available in Singapore's open data portal[3]. The original data is stored in a .csv file[4], and it has been preprocessed and stored in a table with name 'SG_SYSTEM_DEMAND_DATA_TBL' in SAP HANA. Then, the dataset can be accessed via a ConnectionContext to SAP HANA under hana_ml.
import hana_ml
from hana_ml.dataframe import ConnectionContext
cc = ConnectionContext('lsvxc0103.sjc.sap.corp', 30315, 'PAL_USER', 'Welcome1')
df = cc.table('SG_SYSTEM_DEMAND_DATA_TBL')
data = df.sort('datetime').collect()
data
| datetime | system_demand_actual | |
| 0 | 2012-02-06 00:30:00 | 4526.68 |
| 1 | 2012-02-06 01:00:00 | 4436.09 |
| 2 | 2012-02-06 01:30:00 | 4361.24 |
| 3 | 2012-02-06 02:00:00 | 4295.23 |
| 4 | 2012-02-06 02:30:00 | 4233.17 |
| ... | ... | ... |
| 75931 | 2016-06-05 22:00:00 | 5862.00 |
| 75932 | 2016-06-05 22:30:00 | 5756.00 |
| 75933 | 2016-06-05 23:00:00 | 5674.00 |
| 75934 | 2016-06-05 23:30:00 | 5566.00 |
| 75935 | 2016-06-06 00:00:00 | 5446.00 |
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 8))
plt.plot(data['datetime'], data['system_demand_actual'])
plt.title('Figure 1. Overview of the dataset')
plt.show()
plt.close()
df_5wk = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2012-03-12 00:00:00\''.format(df.select_statement))
data_5wk = df_5wk.sort('datetime').collect()plt.figure(figsize=(16, 4))
plt.plot(data_5wk['datetime'], data_5wk['system_demand_actual'])
plt.title('Figure 2: Zoom in view of data of 1st 5 weeks')
plt.show()
plt.close()
df_train = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2016-02-18 00:00:00\''.format(df.select_statement))
df_train.sort('datetime').collect()| datetime | system_demand_actual | |
| 0 | 2012-02-06 00:30:00 | 4526.68 |
| 1 | 2012-02-06 01:00:00 | 4436.09 |
| 2 | 2012-02-06 01:30:00 | 4361.24 |
| 3 | 2012-02-06 02:00:00 | 4295.23 |
| 4 | 2012-02-06 02:30:00 | 4233.17 |
| ... | ... | ... |
| 70699 | 2016-02-17 22:00:00 | 5903.00 |
| 70700 | 2016-02-17 22:30:00 | 5745.00 |
| 70701 | 2016-02-17 23:00:00 | 5597.00 |
| 70702 | 2016-02-17 23:30:00 | 5441.00 |
| 70703 | 2016-02-18 00:00:00 | 5321.00 |
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast # import the python api for additive model analysis from hana_ml amf = AdditiveModelForecast(seasonality_mode='additive', daily_seasonality='true', weekly_seasonality='true', yearly_seasonality='true') amf.fit(df_train)
| datetime | system_demand_actual | |
| 0 | 2012-02-06 00:30:00 | 4526.68 |
| 1 | 2012-02-06 01:00:00 | 4436.09 |
| 2 | 2012-02-06 01:30:00 | 4361.24 |
| 3 | 2012-02-06 02:00:00 | 4295.23 |
| 4 | 2012-02-06 02:30:00 | 4233.17 |
| ... | ... | ... |
| 70699 | 2016-02-17 22:00:00 | 5903.00 |
| 70700 | 2016-02-17 22:30:00 | 5745.00 |
| 70701 | 2016-02-17 23:00:00 | 5597.00 |
| 70702 | 2016-02-17 23:30:00 | 5441.00 |
| 70703 | 2016-02-18 00:00:00 | 5321.00 |
pred_res = amf.predict(df_test)
pred_data = pred_res.sort('datetime').collect()
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['datetime'][50000:], data['system_demand_actual'][50000:])# use partial data for better illustration of the prediction result
plt.plot(pred_data['datetime'], pred_data['YHAT'], color='red')
plt.legend(['Actual Demand', 'Predicted Demand'])
plt.title('Figure 3. Actual Demand VS Predicted Demand')
plt.show()
plt.close()

We can join the real demand values with predicted ones, and compute the prediction accuracy measures.
rename_pred = pred_res.rename_columns({'datetime': 'date_time'})
join_pred = df_test.join(rename_pred[['date_time', 'YHAT']], #YHAT is the column that stores the predicted values
'"datetime"="date_time"')
join_pred.collect()| datetime | system_demand_actual | date_time | YHAT | |
| 0 | 2016-02-18 00:30:00 | 5164.0 | 2016-02-18 00:30:00 | 5218.050311 |
| 1 | 2016-02-18 01:00:00 | 5044.0 | 2016-02-18 01:00:00 | 5103.471702 |
| 2 | 2016-02-18 01:30:00 | 4943.0 | 2016-02-18 01:30:00 | 5010.291241 |
| 3 | 2016-02-18 02:00:00 | 4866.0 | 2016-02-18 02:00:00 | 4937.996240 |
| 4 | 2016-02-18 02:30:00 | 4812.0 | 2016-02-18 02:30:00 | 4882.707945 |
| ... | ... | ... | ... | ... |
| 5227 | 2016-06-05 22:00:00 | 5862.0 | 2016-06-05 22:00:00 | 5777.180270 |
| 5228 | 2016-06-05 22:30:00 | 5756.0 | 2016-06-05 22:30:00 | 5688.704385 |
| 5229 | 2016-06-05 23:00:00 | 5674.0 | 2016-06-05 23:00:00 | 5577.197404 |
| 5230 | 2016-06-05 23:30:00 | 5566.0 | 2016-06-05 23:30:00 | 5452.238527 |
| 5231 | 2016-06-06 00:00:00 | 5446.0 | 2016-06-06 00:00:00 | 5325.501001 |
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
res = accuracy_measure(data = join_pred[['system_demand_actual', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
res.collect()| STAT_NAME | STAT_VALUE | |
| 0 | MAD | 195.786364 |
| 1 | MAPE | 0.033426 |
| 2 | RMSE | 244.762496 |
from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing
aesm = AutoExponentialSmoothing(forecast_model_name='TESM',
forecast_num=df_test.count(),
seasonal='additive',
optimizer_random_seed=1)
df_train_wid = df_train.add_id('ID', 'datetime')
aesm.fit_predict(data=df_train_wid, key='ID',
endog='system_demand_actual')aesm.forecast_.sort('TIMESTAMP').collect()| TIMESTAMP | VALUE | PI1_LOWER | PI1_UPPER | PI2_LOWER | PI2_UPPER | |
| 0 | 1 | 4710.426403 | NaN | NaN | NaN | NaN |
| 1 | 2 | 4420.592040 | NaN | NaN | NaN | NaN |
| 2 | 3 | 4344.739633 | NaN | NaN | NaN | NaN |
| 3 | 4 | 4294.175059 | NaN | NaN | NaN | NaN |
| 4 | 5 | 4237.244328 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 75931 | 75932 | 5611.215324 | 3185.398049 | 8037.032600 | 1901.247953 | 9321.182696 |
| 75932 | 75933 | 5505.265185 | 3079.215918 | 7931.314453 | 1794.943013 | 9215.587358 |
| 75933 | 75934 | 5402.156812 | 2975.875576 | 7828.438048 | 1691.479873 | 9112.833751 |
| 75934 | 75935 | 5295.253762 | 2868.740578 | 7721.766945 | 1584.222091 | 9006.285433 |
| 75935 | 75936 | 5185.558722 | 2758.813614 | 7612.303831 | 1474.172353 | 8896.945092 |
df_wid = df.add_id('ID', 'datetime')
df_test_wid = cc.sql('SELECT * FROM ({}) WHERE ID >= {}'.format(df_wid.select_statement, df_train.count()))
aesm_join_res = df_test_wid.join(aesm.forecast_[['TIMESTAMP', 'VALUE']],##the VALUE column stores fitted & predicted demand values
'ID = TIMESTAMP')
aesm_join_res.collect()| ID | datetime | system_demand_actual | TIMESTAMP | VALUE | |
| 0 | 70704 | 2016-02-18 00:00:00 | 5321.0 | 70704 | 5305.413617 |
| 1 | 70705 | 2016-02-18 00:30:00 | 5164.0 | 70705 | 5170.236697 |
| 2 | 70706 | 2016-02-18 01:00:00 | 5044.0 | 70706 | 5053.972515 |
| 3 | 70707 | 2016-02-18 01:30:00 | 4943.0 | 70707 | 4958.628810 |
| 4 | 70708 | 2016-02-18 02:00:00 | 4866.0 | 70708 | 4887.215553 |
| ... | ... | ... | ... | ... | ... |
| 5228 | 75932 | 2016-06-05 22:00:00 | 5862.0 | 75932 | 5611.215324 |
| 5229 | 75933 | 2016-06-05 22:30:00 | 5756.0 | 75933 | 5505.265185 |
| 5230 | 75934 | 2016-06-05 23:00:00 | 5674.0 | 75934 | 5402.156812 |
| 5231 | 75935 | 2016-06-05 23:30:00 | 5566.0 | 75935 | 5295.253762 |
| 5232 | 75936 | 2016-06-06 00:00:00 | 5446.0 | 75936 | 5185.558722 |
aesm_measure = accuracy_measure(data = aesm_join_res[['system_demand_actual', 'VALUE']],
evaluation_metric = ['rmse', 'mad', 'mape'])
aesm_measure.collect()| STAT_NAME | STAT_VALUE | |
| 0 | MAD | 262.475689 |
| 1 | MAPE | 0.044364 |
| 2 | RMSE | 300.193255 |
fit_res = amf.predict(df_train) #obtain the fitting result for the training data
fit_res.sort('datetime').collect()| datetime | YHAT | YHAT_LOWER | YHAT_UPPER | |
| 0 | 2012-02-06 00:30:00 | 4542.912928 | 4225.863575 | 4856.542969 |
| 1 | 2012-02-06 01:00:00 | 4442.691457 | 4116.797874 | 4751.370439 |
| 2 | 2012-02-06 01:30:00 | 4364.044821 | 4051.549657 | 4685.978732 |
| 3 | 2012-02-06 02:00:00 | 4306.432861 | 3994.744755 | 4599.067855 |
| 4 | 2012-02-06 02:30:00 | 4265.949011 | 3936.221074 | 4602.731532 |
| ... | ... | ... | ... | ... |
| 70699 | 2016-02-17 22:00:00 | 5857.695587 | 5552.015396 | 6172.163877 |
| 70700 | 2016-02-17 22:30:00 | 5755.662651 | 5469.078620 | 6074.457734 |
| 70701 | 2016-02-17 23:00:00 | 5630.292965 | 5292.285728 | 5938.692115 |
| 70702 | 2016-02-17 23:30:00 | 5491.190423 | 5176.592394 | 5808.430988 |
| 70703 | 2016-02-18 00:00:00 | 5350.053660 | 5030.360951 | 5667.936164 |
fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})
fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')
fit_join.collect()| datetime | system_demand_actual | date_time | YHAT | YHAT_LOWER | YHAT_UPPER | |
| 0 | 2012-02-06 00:30:00 | 4526.68 | 2012-02-06 00:30:00 | 4542.912928 | 4225.863575 | 4856.542969 |
| 1 | 2012-02-06 01:00:00 | 4436.09 | 2012-02-06 01:00:00 | 4442.691457 | 4116.797874 | 4751.370439 |
| 2 | 2012-02-06 01:30:00 | 4361.24 | 2012-02-06 01:30:00 | 4364.044821 | 4051.549657 | 4685.978732 |
| 3 | 2012-02-06 02:00:00 | 4295.23 | 2012-02-06 02:00:00 | 4306.432861 | 3994.744755 | 4599.067855 |
| 4 | 2012-02-06 02:30:00 | 4233.17 | 2012-02-06 02:30:00 | 4265.949011 | 3936.221074 | 4602.731532 |
| ... | ... | ... | ... | ... | ... | ... |
| 707699 | 2016-02-17 22:00:00 | 5903.00 | 2016-02-17 22:00:00 | 5857.695587 | 5552.015396 | 6172.163877 |
| 70700 | 2016-02-17 22:30:00 | 5745.00 | 2016-02-17 22:30:00 | 5755.662651 | 5469.078620 | 6074.457734 |
| 70701 | 2016-02-17 23:00:00 | 5597.00 | 2016-02-17 23:00:00 | 5630.292965 | 5292.285728 | 5938.692115 |
| 70702 | 2016-02-17 23:30:00 | 5441.00 | 2016-02-17 23:30:00 | 5491.190423 | 5176.592394 | 5808.430988 |
| 70703 | 2016-02-18 00:00:00 | 5321.00 | 2016-02-18 00:00:00 | 5350.053660 | 5030.360951 | 5667.936164 |
residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['datetime'][0:df_train.count()], residue.collect()['RESID'])
plt.title('Figure 4. Overview of the random component of training data')
plt.show()
plt.close()
abnormal = cc.sql('SELECT * FROM ({}) ORDER BY RESID DESC LIMIT 100'.format(residue.select_statement))
abnormal_dates = abnormal[['datetime']].cast('datetime', 'DATE').drop_duplicates()
abnormal_dates.sort('datetime').collect()| datetime | |
| 0 | 2013-02-11 |
| 1 | 2013-02-12 |
| 2 | 2014-01-31 |
| 3 | 2014-10-22 |
| 4 | 2015-02-19 |
| 5 | 2015-02-20 |
| 6 | 2016-02-08 |
| 7 | 2016-02-09 |
holidays = cc.table('SG_HOLIDAY_TBL').cast('Date', 'DATE')#The Date column in holiday table is of type 'TIMESTAMP', convert to 'DATE'
holidays.head(5).collect()| DATE | NAME | |
| 0 | 2012-01-02 | NY |
| 1 | 2012-01-23 | CNY |
| 2 | 2012-01-24 | CNY |
| 3 | 2012-04-06 | GF |
| 4 | 2012-05-01 | LD |
amf.fit(data = df_train, holiday=holidays)
fit_res = amf.predict(df_train)
fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})
fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')
residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['date'][0:df_train.count()], residue.collect()['RESID'])
plt.title('Figure 4. Overview of random component of training data with holiday info.')
plt.show()
plt.close()
pred_res2 = amf.predict(df_test)
renm_pred2 = pred_res2.rename_columns({'datetime': 'date_time'})
join_pred2 = df_test.join(renm_pred2[['date_time', 'YHAT']], '"datetime"="date_time"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
res = accuracy_measure(data = join_pred2[['system_demand_actual', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
res.collect()| STAT_NAME | STAT_VALUE | |
| 0 | MAD | 175.825074 |
| 1 | MAPE | 0.030518 |
| 2 | RMSE | 220.948932 |
| month | total_international_visitor_arrivals | |
| 0 | 1978-01-01 | 167016 |
| 1 | 1978-02-01 | 147954 |
| 2 | 1978-03-01 | 163199 |
| 3 | 1978-04-01 | 162400 |
| 4 | 1978-05-01 | 162667 |
| ... | ... | ... |
| 450 | 2015-07-01 | 1519188 |
| 451 | 2015-08-01 | 1445052 |
| 452 | 2015-09-01 | 1131967 |
| 453 | 2015-10-01 | 1246706 |
| 454 | 2015-11-01 | 1193906 |
train_num = 408
test_num = psg.count() - train_num
psg_train = psg.head(train_num)
psg_test = psg.sort('month', desc=True).head(test_num).sort('month')psg_train_data = psg_train.collect()
import matplotlib.pyplot as plt
plt.figure(figsize=(16*(408/455),5))
plt.plot(psg_train_data['month'],
psg_train_data['total_international_visitor_arrivals'])
plt.show()
plt.close()
event = cc.table('SG_SARS_MONTH_TBL').cast('Month', 'DATE')
event.collect()| Month | Name | |
| 0 | 2004-04-01 | SARS |
| 1 | 2004-05-01 | SARS |
| 2 | 2004-06-01 | SARS |
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast amf = AdditiveModelForecast(seasonality_mode='multiplicative') amf.fit(data=psg_train, holiday=event)
psg_fit = amf.predict(psg_train)# obtain fitted value for training data
psg_fit_rename_month = psg_fit.rename_columns({'month': 'Month'})
join_true_and_fit = psg_train.join(psg_fit_rename_month, '"month" = "Month"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
fit_res = accuracy_measure(data = join_true_and_fit[['total_international_visitor_arrivals', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
fit_res.collect()| STAT_NAME | STAT_VALUE | |
| 0 | MAD | 31945.872096 |
| 1 | MAPE | 0.067876 |
| 2 | RMSE | 51823.310879 |
psg_pred = amf.predict(psg_test)
psg_pred_rename_month = psg_pred.rename_columns({'month': 'Month'})
join_true_and_pred = psg_test.join(psg_pred_rename_month, '"month" = "Month"')
psg_pred_acc = accuracy_measure(data = join_true_and_pred[['total_international_visitor_arrivals', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
psg_pred_acc.collect()
| STAT_NAME | STAT_VALUE | |
| 0 | MAD | 119813.052030 |
| 1 | MAPE | 0.093767 |
| 2 | RMSE | 139003.141614 |
pred_data = psg_pred.collect() psg_test_data = psg_test.collect() import matplotlib.pyplot as plt plt.figure(figsize=(16,5)) plt.plot(psg_test_data['month'], psg_test_data['total_international_visitor_arrivals']) plt.plot(pred_data['month'], pred_data['YHAT'], color='red') plt.show() plt.close()

import pandas as pd psg_train_data = psg_train.collect() changepoints = list(pd.to_datetime(pd.Series(['1982-11-01', '1986-01-01', '1991-05-01', '1995-09-01', '1997-10-01', '2000-07-01', '2004-02-01', '2007-12-01', '2009-03-01']))) event_points = list(pd.to_datetime(pd.Series(['2003-04-01', '2003-05-01', '2003-06-01']))) cp_data = psg_train_data.loc[[x in changepoints for x in psg_train_data['month']],:] event_data = psg_train_data.loc[[x in event_points for x in psg_train_data['month']],:]
plt.figure(figsize=(int(16*(408/455)),5)) plt.plot(psg_train_data['month'], psg_train_data['total_international_visitor_arrivals']) plt.scatter(cp_data['month'], cp_data['total_international_visitor_arrivals'], color='red') plt.scatter(event_data['month'], event_data['total_international_visitor_arrivals'], color='purple') plt.legend(['Visitor Arrivals', 'Specified Change-points', 'Event Points']) plt.show() plt.close()

amf = AdditiveModelForecast(seasonality_mode='multiplicative',
changepoints=['1982-11-01 00:00:00',
'1986-01-01 00:00:00',
'1991-05-01 00:00:00',
'1995-09-01 00:00:00',
'2000-07-01 00:00:00',
'2004-02-01 00:00:00',
'2007-12-01 00:00:00',
'2009-03-01 00:00:00'])
amf.fit(data=psg_train, holiday=event)psg_fit2 = amf.predict(psg_train)
psg_fit2_rename_month = psg_fit2.rename_columns({'month': 'Month'})
join_true_and_fit2 = psg_train.join(psg_fit2_rename_month, '"month" = "Month"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
fit_res2 = accuracy_measure(data = join_true_and_fit2[['total_international_visitor_arrivals', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
fit_res2.collect()| STAT_NAME | STAT_VALUE | |
| 0 | MAD | 32760.561142 |
| 1 | MAPE | 0.070333 |
| 2 | RMSE | 52156.801417 |
| STAT_NAME | STAT_VALUE | |
| 0 | MAD | 93115.156652 |
| 1 | MAPE | 0.073317 |
| 2 | RMSE | 111212.533575 |
pred2_data = psg_pred2.collect()
plt.figure(figsize=(16,5))
plt.plot(psg_test_data['month'],
psg_test_data['total_international_visitor_arrivals'])
plt.plot(pred2_data['month'], pred2_data['YHAT'], color='red')
plt.legend(['Actual Arrivals', 'Predicted Arrivals'])
plt.show()
plt.close()
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 36 | |
| 28 | |
| 27 | |
| 26 | |
| 26 | |
| 26 | |
| 24 | |
| 23 | |
| 22 | |
| 22 |