Additive Model Time-series Analysis using Python M...

likun_hou · ‎2021 Jan 18

In a few related blog posts[1][2], we have shown the analysis of time-series using traditional approaches like seasonal decomposition, ARIMA and exponential smoothing. However, those traditional approaches have some drawbacks for time-series analysis, typically ineffective in dealing time-series with complicated trend, multiple seasonality as well as cyclic patterns(repeated with non-fixed period, like holidays, event days, etc). The aforementioned drawbacks can be largely overcome by a relatively new method called additive model time-series analysis(additive model analysis is often used for short in later context without confusion). This method is provided in the SAP HANA Predictive Analysis Library(PAL), and wrapped up in the Python machine learning client for SAP HANA(hana_ml).

In this blog post, we show readers how to apply additive model analysis to analyze time-series with specific case studies.

The rest of this blog post is organized as follows: firstly we give a brief introduction to additive model analysis, then we go to two specific case studies, inclusive of data description, modeling and forecasting, and finally we summarize the main content of this blog post.

Introduction

Additive model analysis is a newly emerged approach for time-series modeling. Unlike traditional approaches(like ARIMA and exponential smoothing) that explore time-based dependencies among observations, it treats time-series modeling as a curve-fitting problem, and uses an additive model to fit/forecast time-series data. Under this setting, the given time-series would be decomposed into four components: trend, seasonality, cyclic patterns, and a random component. The formula is as follows:

𝑦(𝑡)=𝑔(𝑡)+𝑠(𝑡)+ℎ(𝑡)+ϵ(𝑡).

It can handle data with complicated trend(using piecewise linear approximation), multiple seasonality(daily, weekly, yearly and beyond, event if period being non-integer) with strong seasonal effects. Besides, the modeling of cyclic patterns also helps to explain variations with non-fixing period. As a consequence, additive model time-series analysis often has better performance when comparing with other classical methods for time-series modeling/analysis.

Case Study I : Additive Model Analysis for Electricity System Demand in Singapore

Dataset Description

The first dataset of our interest in this blog post is the historical electricity system demand for every half-hour period(MW) in Singapore, which is available in Singapore's open data portal[3]. The original data is stored in a .csv file[4], and it has been preprocessed and stored in a table with name 'SG_SYSTEM_DEMAND_DATA_TBL' in SAP HANA. Then, the dataset can be accessed via a ConnectionContext to SAP HANA under hana_ml.

import hana_ml

from hana_ml.dataframe import ConnectionContext

cc = ConnectionContext('lsvxc0103.sjc.sap.corp', 30315, 'PAL_USER', 'Welcome1')

df = cc.table('SG_SYSTEM_DEMAND_DATA_TBL')

data = df.sort('datetime').collect()

data

	datetime	system_demand_actual
0	2012-02-06 00:30:00	4526.68
1	2012-02-06 01:00:00	4436.09
2	2012-02-06 01:30:00	4361.24
3	2012-02-06 02:00:00	4295.23
4	2012-02-06 02:30:00	4233.17
...	...	...
75931	2016-06-05 22:00:00	5862.00
75932	2016-06-05 22:30:00	5756.00
75933	2016-06-05 23:00:00	5674.00
75934	2016-06-05 23:30:00	5566.00
75935	2016-06-06 00:00:00	5446.00

75936 rows × 2 columns

In the following context we draw the run-sequence plot of the full time-series.

import matplotlib.pyplot as plt

plt.figure(figsize=(16, 8))

plt.plot(data['datetime'], data['system_demand_actual'])

plt.title('Figure 1. Overview of the dataset')

plt.show()

plt.close()

A coarse examination of the plot indicates that the data have approximately linear trend and yearly seasonality. Now let us zoom in to see if we can have other findings. In the following we take out data of the first 5 weeks for detailed examination.

df_5wk = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2012-03-12 00:00:00\''.format(df.select_statement))

data_5wk = df_5wk.sort('datetime').collect()

Now the 5-week data has been collected to the python client, and we can draw the run-sequence plot in python.

plt.figure(figsize=(16, 4))

plt.plot(data_5wk['datetime'], data_5wk['system_demand_actual'])

plt.title('Figure 2: Zoom in view of data of 1st 5 weeks')

plt.show()

plt.close()

The plot shows that the data also have strong weekly seasonality, and(weak) daily seasonality caused by the similarity of demand pattern between weekdays(the demand pattern in weekends are also simliar, yet with smaller magnitude).

In summary, the given time-series for analysis contains multiple seasonality, inclusive of yearly seasonality(with non-integer period 365.25). All these characteristics make the analysis of this time-series difficult when using traditional approaches like ARIMA, exponential smoothing as well as seasonal decompose.

Modeling and Forecast using Additive Model Analysis

Additive models analysis is a new method that treats time-series modeling as a curve-fitting problem with respect to time. In contrast, exponential smoothing and ARIMA try model the dependencies of the current data with the past(inclusive of expected values and errors). Besides, additive model analysis use (partial) Fourier series to model periodic patterns in the input data. By doing so, the problem of non-integer periods is easily resolved, and the problem of multiple periods also become non-critical. Moreover, additive model analysis also allows users to input events(usually cyclical, called holidays in the Python API) that could potentially affect the the shape of the curve for given time-series during a short period of time and break up the regular seasonal pattern. This often leads to better modeling for time-series.

Now let us apply additive model analysis to our dataset of electricity system demand. As a standard procedure in machine learning, we firstly split the dataset into two parts for training and testing purposes, respectively.

df_train = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2016-02-18 00:00:00\''.format(df.select_statement))

df_train.sort('datetime').collect()

	datetime	system_demand_actual
0	2012-02-06 00:30:00	4526.68
1	2012-02-06 01:00:00	4436.09
2	2012-02-06 01:30:00	4361.24
3	2012-02-06 02:00:00	4295.23
4	2012-02-06 02:30:00	4233.17
...	...	...
70699	2016-02-17 22:00:00	5903.00
70700	2016-02-17 22:30:00	5745.00
70701	2016-02-17 23:00:00	5597.00
70702	2016-02-17 23:30:00	5441.00
70703	2016-02-18 00:00:00	5321.00

70704 rows × 2 columns

Now let us use the training set to fit a model for additive model analysis.

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast # import the python api for additive model analysis from hana_ml

amf = AdditiveModelForecast(seasonality_mode='additive', daily_seasonality='true', weekly_seasonality='true', yearly_seasonality='true')

amf.fit(df_train)

	datetime	system_demand_actual
0	2012-02-06 00:30:00	4526.68
1	2012-02-06 01:00:00	4436.09
2	2012-02-06 01:30:00	4361.24
3	2012-02-06 02:00:00	4295.23
4	2012-02-06 02:30:00	4233.17
...	...	...
70699	2016-02-17 22:00:00	5903.00
70700	2016-02-17 22:30:00	5745.00
70701	2016-02-17 23:00:00	5597.00
70702	2016-02-17 23:30:00	5441.00
70703	2016-02-18 00:00:00	5321.00

70704 rows × 2 columns

The model fitting procedure takes some time. After it is done, we may apply the trained additive model to the test data, in which case only time-stamps matter irrespective of the demand values.

pred_res = amf.predict(df_test)

pred_data = pred_res.sort('datetime').collect()

import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.plot(data['datetime'][50000:], data['system_demand_actual'][50000:])# use partial data for better illustration of the prediction result

plt.plot(pred_data['datetime'], pred_data['YHAT'], color='red')

plt.legend(['Actual Demand', 'Predicted Demand'])

plt.title('Figure 3. Actual Demand VS Predicted Demand')

plt.show()

plt.close()

We can join the real demand values with predicted ones, and compute the prediction accuracy measures.

rename_pred = pred_res.rename_columns({'datetime': 'date_time'})

join_pred = df_test.join(rename_pred[['date_time', 'YHAT']], #YHAT is the column that stores the predicted values

                         '"datetime"="date_time"')

join_pred.collect()

	datetime	system_demand_actual	date_time	YHAT
0	2016-02-18 00:30:00	5164.0	2016-02-18 00:30:00	5218.050311
1	2016-02-18 01:00:00	5044.0	2016-02-18 01:00:00	5103.471702
2	2016-02-18 01:30:00	4943.0	2016-02-18 01:30:00	5010.291241
3	2016-02-18 02:00:00	4866.0	2016-02-18 02:00:00	4937.996240
4	2016-02-18 02:30:00	4812.0	2016-02-18 02:30:00	4882.707945
...	...	...	...	...
5227	2016-06-05 22:00:00	5862.0	2016-06-05 22:00:00	5777.180270
5228	2016-06-05 22:30:00	5756.0	2016-06-05 22:30:00	5688.704385
5229	2016-06-05 23:00:00	5674.0	2016-06-05 23:00:00	5577.197404
5230	2016-06-05 23:30:00	5566.0	2016-06-05 23:30:00	5452.238527
5231	2016-06-06 00:00:00	5446.0	2016-06-06 00:00:00	5325.501001

5232 rows × 4 columns

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

res = accuracy_measure(data = join_pred[['system_demand_actual', 'YHAT']],

                       evaluation_metric = ['rmse', 'mad', 'mape'])

res.collect()

	STAT_NAME	STAT_VALUE
0	MAD	195.786364
1	MAPE	0.033426
2	RMSE	244.762496

MAD and RMSE are both small compared with the demand values in the dataset (~200 vs ~5000), consistent with the mean absolute percent error(MAPE) which is around 3.3%.

For the sake of comparison, we apply AutoExponentialSmoothing on the same set of training data, and then evaluate the performance of the generated model on test data.

For saving computational time and improving model quality, we can pre-specify a few parameters when initializing the AutoExponentialSmoothing instance: since the data contains both trend and seasonality components, triple exponential smoothing should be chosen to handle them; seasonal components do not vary with level, which suggests additive seasonality, so the seasonal parameter should be set to 'additive'.

from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing

aesm = AutoExponentialSmoothing(forecast_model_name='TESM',

                                forecast_num=df_test.count(),

                                seasonal='additive',

                                optimizer_random_seed=1)

df_train_wid = df_train.add_id('ID', 'datetime')

aesm.fit_predict(data=df_train_wid, key='ID',

                 endog='system_demand_actual')

aesm.forecast_.sort('TIMESTAMP').collect()

	TIMESTAMP	VALUE	PI1_LOWER	PI1_UPPER	PI2_LOWER	PI2_UPPER
0	1	4710.426403	NaN	NaN	NaN	NaN
1	2	4420.592040	NaN	NaN	NaN	NaN
2	3	4344.739633	NaN	NaN	NaN	NaN
3	4	4294.175059	NaN	NaN	NaN	NaN
4	5	4237.244328	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...
75931	75932	5611.215324	3185.398049	8037.032600	1901.247953	9321.182696
75932	75933	5505.265185	3079.215918	7931.314453	1794.943013	9215.587358
75933	75934	5402.156812	2975.875576	7828.438048	1691.479873	9112.833751
75934	75935	5295.253762	2868.740578	7721.766945	1584.222091	9006.285433
75935	75936	5185.558722	2758.813614	7612.303831	1474.172353	8896.945092

75936 rows × 6 columns

Again the predicted demand values and the actual ones can be joined for prediction accuracy measure computation.

df_wid = df.add_id('ID', 'datetime')

df_test_wid = cc.sql('SELECT * FROM ({}) WHERE ID >= {}'.format(df_wid.select_statement, df_train.count()))

aesm_join_res = df_test_wid.join(aesm.forecast_[['TIMESTAMP', 'VALUE']],##the VALUE column stores fitted & predicted demand values

                                 'ID = TIMESTAMP')

aesm_join_res.collect()

	ID	datetime	system_demand_actual	TIMESTAMP	VALUE
0	70704	2016-02-18 00:00:00	5321.0	70704	5305.413617
1	70705	2016-02-18 00:30:00	5164.0	70705	5170.236697
2	70706	2016-02-18 01:00:00	5044.0	70706	5053.972515
3	70707	2016-02-18 01:30:00	4943.0	70707	4958.628810
4	70708	2016-02-18 02:00:00	4866.0	70708	4887.215553
...	...	...	...	...	...
5228	75932	2016-06-05 22:00:00	5862.0	75932	5611.215324
5229	75933	2016-06-05 22:30:00	5756.0	75933	5505.265185
5230	75934	2016-06-05 23:00:00	5674.0	75934	5402.156812
5231	75935	2016-06-05 23:30:00	5566.0	75935	5295.253762
5232	75936	2016-06-06 00:00:00	5446.0	75936	5185.558722

5233 rows × 5 columns

aesm_measure = accuracy_measure(data = aesm_join_res[['system_demand_actual', 'VALUE']], 

                                evaluation_metric = ['rmse', 'mad', 'mape'])

aesm_measure.collect()

	STAT_NAME	STAT_VALUE
0	MAD	262.475689
1	MAPE	0.044364
2	RMSE	300.193255

One can see from the values of selected accuracy measures that the performance of exponential smoothing is not as good as additive model analysis, which is consistent with our previous analysis on the dataset.

Adding Holidays for Improved Forecast

Up until now we haven't talked about the holiday effect in additive model time-series analysis. It is seen from Figure 2 that, in weekends the system demands are usually smallers compared to weekdays, we natually suspect that in holidays the systems demand will also deviate from the regular pattern in weekdays. We verify this by checking the residue of the training data(i.e. differences between fitted and actual values) when additive model analysis is applied.

fit_res = amf.predict(df_train) #obtain the fitting result for the training data

fit_res.sort('datetime').collect()

	datetime	YHAT	YHAT_LOWER	YHAT_UPPER
0	2012-02-06 00:30:00	4542.912928	4225.863575	4856.542969
1	2012-02-06 01:00:00	4442.691457	4116.797874	4751.370439
2	2012-02-06 01:30:00	4364.044821	4051.549657	4685.978732
3	2012-02-06 02:00:00	4306.432861	3994.744755	4599.067855
4	2012-02-06 02:30:00	4265.949011	3936.221074	4602.731532
...	...	...	...	...
70699	2016-02-17 22:00:00	5857.695587	5552.015396	6172.163877
70700	2016-02-17 22:30:00	5755.662651	5469.078620	6074.457734
70701	2016-02-17 23:00:00	5630.292965	5292.285728	5938.692115
70702	2016-02-17 23:30:00	5491.190423	5176.592394	5808.430988
70703	2016-02-18 00:00:00	5350.053660	5030.360951	5667.936164

70704 rows × 4 columns

We join the fitted values with actual ones for computing their differences.

fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})

fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')

fit_join.collect()

	datetime	system_demand_actual	date_time	YHAT	YHAT_LOWER	YHAT_UPPER
0	2012-02-06 00:30:00	4526.68	2012-02-06 00:30:00	4542.912928	4225.863575	4856.542969
1	2012-02-06 01:00:00	4436.09	2012-02-06 01:00:00	4442.691457	4116.797874	4751.370439
2	2012-02-06 01:30:00	4361.24	2012-02-06 01:30:00	4364.044821	4051.549657	4685.978732
3	2012-02-06 02:00:00	4295.23	2012-02-06 02:00:00	4306.432861	3994.744755	4599.067855
4	2012-02-06 02:30:00	4233.17	2012-02-06 02:30:00	4265.949011	3936.221074	4602.731532
...	...	...	...	...	...	...
707699	2016-02-17 22:00:00	5903.00	2016-02-17 22:00:00	5857.695587	5552.015396	6172.163877
70700	2016-02-17 22:30:00	5745.00	2016-02-17 22:30:00	5755.662651	5469.078620	6074.457734
70701	2016-02-17 23:00:00	5597.00	2016-02-17 23:00:00	5630.292965	5292.285728	5938.692115
70702	2016-02-17 23:30:00	5441.00	2016-02-17 23:30:00	5491.190423	5176.592394	5808.430988
70703	2016-02-18 00:00:00	5321.00	2016-02-18 00:00:00	5350.053660	5030.360951	5667.936164

70704 rows × 6 columns

residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))

import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.plot(data['datetime'][0:df_train.count()], residue.collect()['RESID'])

plt.title('Figure 4. Overview of the random component of training data')

plt.show()

plt.close()

One can see that there are many(mainly upward w.r.t. zero value) large spiky values in the residual, which indicates that in certain days the actual demand values would fall below the predicted values by a large amount. For further examination, we use the deviation of the residual value from 0 as the measurement of how abnormality an data point is, then the date information of the top 100 result in the dataset is extracted.

abnormal = cc.sql('SELECT * FROM ({}) ORDER BY RESID DESC LIMIT 100'.format(residue.select_statement))

abnormal_dates = abnormal[['datetime']].cast('datetime', 'DATE').drop_duplicates()

abnormal_dates.sort('datetime').collect()

	datetime
0	2013-02-11
1	2013-02-12
2	2014-01-31
3	2014-10-22
4	2015-02-19
5	2015-02-20
6	2016-02-08
7	2016-02-09

The dates listed above are all public holidays in Singapore(2014-10-22 is Deepavali Day, others are for Lunar New Year), which indicates that system demand will fall below the regular level if the day for inspection happens to be a public holiday. This justifies our previous suspection.

However, since major public holidays of a year usually shall be announced even before the year actuall starts, it is okay for us to incorporate them in the model building and model prediction phase. Additive model analysis provides us this facility, so we can collect the set of holidays(exclusive of weekends) within the dates range of the input data and store it in SAP HANA.

holidays = cc.table('SG_HOLIDAY_TBL').cast('Date', 'DATE')#The Date column in holiday table is of type 'TIMESTAMP', convert to 'DATE'

holidays.head(5).collect()

	DATE	NAME
0	2012-01-02	NY
1	2012-01-23	CNY
2	2012-01-24	CNY
3	2012-04-06	GF
4	2012-05-01	LD

where NY = New Year, CNY = Chinese New Year, GF = Good Friday, LD = Labor Day.

Now let's re-fit the additive time-series model with the introduction of holidays.

amf.fit(data = df_train, holiday=holidays)

We re-plot the residue component.

fit_res = amf.predict(df_train)

fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})

fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')

residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))

import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.plot(data['date'][0:df_train.count()], residue.collect()['RESID'])

plt.title('Figure 4. Overview of random component of training data with holiday info.')

plt.show()

plt.close()

We can see from the residue component of the training data that, after public holiday information is introduced, those original excessively large residual values become less frequent, indicating smaller fitting error. In general, the introduction of holidays helps the model in capturing the associated cyclic patterns.

Let us verify whether the prediction error could be reduced as well by apply the re-fitted additive model on the test data.

pred_res2 = amf.predict(df_test)

renm_pred2 = pred_res2.rename_columns({'datetime': 'date_time'})

join_pred2 = df_test.join(renm_pred2[['date_time', 'YHAT']], '"datetime"="date_time"')

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

res = accuracy_measure(data = join_pred2[['system_demand_actual', 'YHAT']],

                       evaluation_metric = ['rmse', 'mad', 'mape'])

res.collect()

	STAT_NAME	STAT_VALUE
0	MAD	175.825074
1	MAPE	0.030518
2	RMSE	220.948932

One can see that all selected accuracy measures improve by a non-negligible amount on the test data thanks to the addition of holiday information：

MAD : from ~195.8 to ~175.8，

MAPE : from ~0.0335 to ~0.0305，

RMSE : from ~224.8 to ~220.9

Case Study II ：Additive Model Analysis for International Air Arrivals in Singapore

Dataset Description

The data we use in this section is the monthly international visitor arrivals at Changi airport in Singapore[5], available also in Singapore's open data portal[3]. The time-series data have an overall upward trend, while the growth rate could be different within different time-intervals. The data has been preprocessed and stored in HANA database in a table with the name of 'SG_INTER_VISITORS_TBL'.

	month	total_international_visitor_arrivals
0	1978-01-01	167016
1	1978-02-01	147954
2	1978-03-01	163199
3	1978-04-01	162400
4	1978-05-01	162667
...	...	...
450	2015-07-01	1519188
451	2015-08-01	1445052
452	2015-09-01	1131967
453	2015-10-01	1246706
454	2015-11-01	1193906

455 rows × 2 columns

The dataset contains two columns, with the 1st column being month info(represented by the 1st day of that month) and 2nd column being to total number of international visitor arrivals. The range of time for this time-series data is from Jan. 1978 to Nov. 2015, with a total number of 455 records.

To proceed further, we divide dataset into training part and testing part, respectively. All records from year 1978 to year 2011, i.e. the heading 408 records, are used for training an additive time-series model, while the rest of records are used for model evaluation.

train_num = 408

test_num = psg.count() - train_num

psg_train = psg.head(train_num)

psg_test = psg.sort('month', desc=True).head(test_num).sort('month')

In the following we examine the training data by drawing its run-sequence plot.

psg_train_data = psg_train.collect()

import matplotlib.pyplot as plt

plt.figure(figsize=(16*(408/455),5))

plt.plot(psg_train_data['month'],

         psg_train_data['total_international_visitor_arrivals'])

plt.show()

plt.close()

One can see that in general the number of air arrivals keep increasing with time. However, the increasing rate is non-stationary, and there are also some intermittent drops that are significant, making the trend of this time-series difficult to model.

Additive model time-series analysis also gives us the flexibility in modeling the trend in a picewise manner. If not specified manually, it will automatically identifying a collection of change-points for the input time-series, where the model of trend is allowed to change when tranversing through a change-point. However, for time-series with complicated trend, the detected change-points produced by the default mechanism are often sub-optimal, leading to over-fitting or under-fitting for the trend of the training data. In contrast, manually labeled change-points(often could be obtained from outer sources) could sometimes alleviate this weakness and do a better job for time-series modeling and prediction.

In this section, we show the benefit of the flexibility provided by additive model analysis for trend modeling, using the above monthly air arrival dataset as an example.

Modeling and Forecast with Additive Model Analaysis

One can see from the run-sequence plot of the air arrival data that, in April, May and June 2003, the number of international arrivals undergoes a significant drop from its regular level. This is due to the short-term outbreak of SARS cronavirus in Singapore. From the discussion of our previous section, we may include this event when dealing with this time-series. So we store the corresponding values of the two months in a table named by 'SG_SARS_MONTH_TBL' in SAP HANA.

event = cc.table('SG_SARS_MONTH_TBL').cast('Month', 'DATE')

event.collect()

	Month	Name
0	2004-04-01	SARS
1	2004-05-01	SARS
2	2004-06-01	SARS

It is seen from the training data that, local variations increases with the growing number of international arrivals, which is typically regards a single a multiplicative seasonality(if there is any).

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast

amf = AdditiveModelForecast(seasonality_mode='multiplicative')

amf.fit(data=psg_train, holiday=event)

Then let us check how well the model fits the training data.

psg_fit = amf.predict(psg_train)# obtain fitted value for training data

psg_fit_rename_month = psg_fit.rename_columns({'month': 'Month'})

join_true_and_fit = psg_train.join(psg_fit_rename_month, '"month" = "Month"')

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

fit_res = accuracy_measure(data = join_true_and_fit[['total_international_visitor_arrivals', 'YHAT']], 

                           evaluation_metric = ['rmse', 'mad', 'mape'])

fit_res.collect()

	STAT_NAME	STAT_VALUE
0	MAD	31945.872096
1	MAPE	0.067876
2	RMSE	51823.310879

Next we apply the trained model to the test data, and evaluate the prediction result using popular evaluation metrics.

psg_pred = amf.predict(psg_test)

psg_pred_rename_month = psg_pred.rename_columns({'month': 'Month'})

join_true_and_pred = psg_test.join(psg_pred_rename_month, '"month" = "Month"')

psg_pred_acc = accuracy_measure(data = join_true_and_pred[['total_international_visitor_arrivals', 'YHAT']],

                                evaluation_metric = ['rmse', 'mad', 'mape'])

psg_pred_acc.collect()

	STAT_NAME	STAT_VALUE
0	MAD	119813.052030
1	MAPE	0.093767
2	RMSE	139003.141614

Both MAD and RMSE are around 1e+5, one order smaller compared to the number of arrivals(~ 1e+6), consistent with MAPE(~ 0.09).

Since the time-series of interest in this section has larger variations when its level value goes higher, so it is natual that MSE and RMSE have larger values on the test data since the level is expected to be higher within the prediction period. However, mean absolute percentage error is scale-invariant w.r.t. level of time-series, so compared to that of training data, its higher value on the test data strongly suggests that the training data could be over-fitted by the additive time-series model.

For better illustrattion, the real and predicted values are plotted as follows:

pred_data = psg_pred.collect()

psg_test_data = psg_test.collect()

import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.plot(psg_test_data['month'], psg_test_data['total_international_visitor_arrivals'])

plt.plot(pred_data['month'], pred_data['YHAT'], color='red')

plt.show()

plt.close()

As discussed in the beginning of this section, when building the above additive model, a set of change-points is detected implicitly using a defaulted mechanism, and the detected change-points may not be optimal. So for the input time-series, after a careful examination of training data, we suggest the following set of changepoints:

'1982-11-01', '1986-01-01', '1991-05-01', '1995-09-01', '1997-10-01', '2000-07-01', '2004-02-01', '2007-12-01', '2009-03-01'.

Let us first mark out the specified change-points in the training data via run-sequence plot.

import pandas as pd

psg_train_data = psg_train.collect()

changepoints = list(pd.to_datetime(pd.Series(['1982-11-01', '1986-01-01', '1991-05-01', '1995-09-01', '1997-10-01', '2000-07-01', '2004-02-01', '2007-12-01', '2009-03-01'])))

event_points = list(pd.to_datetime(pd.Series(['2003-04-01', '2003-05-01', '2003-06-01'])))

cp_data = psg_train_data.loc[[x in changepoints for x in psg_train_data['month']],:]

event_data = psg_train_data.loc[[x in event_points for x in psg_train_data['month']],:]

plt.figure(figsize=(int(16*(408/455)),5))

plt.plot(psg_train_data['month'], psg_train_data['total_international_visitor_arrivals'])

plt.scatter(cp_data['month'], cp_data['total_international_visitor_arrivals'], color='red')

plt.scatter(event_data['month'], event_data['total_international_visitor_arrivals'], color='purple')

plt.legend(['Visitor Arrivals', 'Specified Change-points', 'Event Points'])

plt.show()

plt.close()

Next, we manually specify these points as change-points when building up the additive model for time-series analysis.

amf = AdditiveModelForecast(seasonality_mode='multiplicative',

                            changepoints=['1982-11-01 00:00:00',

                                          '1986-01-01 00:00:00',

                                          '1991-05-01 00:00:00',

                                          '1995-09-01 00:00:00',

                                          '2000-07-01 00:00:00',

                                          '2004-02-01 00:00:00',

                                          '2007-12-01 00:00:00',

                                          '2009-03-01 00:00:00'])

amf.fit(data=psg_train, holiday=event)

Again let us check how well the model fits the training data.

psg_fit2 = amf.predict(psg_train)

psg_fit2_rename_month = psg_fit2.rename_columns({'month': 'Month'})

join_true_and_fit2 = psg_train.join(psg_fit2_rename_month, '"month" = "Month"')

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

fit_res2 = accuracy_measure(data = join_true_and_fit2[['total_international_visitor_arrivals', 'YHAT']], 

                            evaluation_metric = ['rmse', 'mad', 'mape'])

fit_res2.collect()

	STAT_NAME	STAT_VALUE
0	MAD	32760.561142
1	MAPE	0.070333
2	RMSE	52156.801417

So the metrics for model fitting on the training dataset is not as good as the case when using default mechanism for change-point detection, but the result are comparable.

Now we evaluate the re-fitted model on the test data.

	STAT_NAME	STAT_VALUE
0	MAD	93115.156652
1	MAPE	0.073317
2	RMSE	111212.533575

Thanks to the manually specified change-points, all evaluation metrics have improved on the test data. In particular, mean absolute percentage error(MAPE) has been dropped by nearly 2%, making it close the MAPE value on the training data.

We can check the consistency between real and predicted arrivals by drawing their run-sequence plots.

pred2_data = psg_pred2.collect()

plt.figure(figsize=(16,5))

plt.plot(psg_test_data['month'],

         psg_test_data['total_international_visitor_arrivals'])

plt.plot(pred2_data['month'], pred2_data['YHAT'], color='red')

plt.legend(['Actual Arrivals', 'Predicted Arrivals'])

plt.show()

plt.close()

Summary

In this blog post, we have shown readers how additive model analysis faciliates us in handling time-series with trend, seasonality and cyclic patterns. Compared with traditional approaches that rely on stationarity and lagged dependency, the curve-fitting nature of additive time-series models provides us more flexibilities for time-series modeling.

References

[1] Time-Series Modeling and Analysis using SAP HANA Predictive Analysis Library(PAL) through Python Mac...

[2] Anomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for ...

[3] Portal website: data.gov.sg

[4] https://storage.data.gov.sg/half-hourly-system-demand/resources/half-hourly-system-demand-data-from-...

[5] Dataset website: data.gov.sg/dataset/total-visitor-international-arrivals-to-singapore

By Category

Related Content

Activity Groups

Industry Groups

Influence and Feedback Groups

Interest Groups

Location Groups

Customer Only Groups

Forums

Related Resources

Products

Learning and Support

About

My SAP Profile

My SAP Profile