Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
likun_hou
Advisor
Advisor
6,747

In a few related blog posts[1][2], we have shown the analysis of time-series using traditional approaches like seasonal decomposition, ARIMA and exponential smoothing. However, those traditional approaches have some drawbacks for time-series analysis, typically ineffective in dealing time-series with complicated trend, multiple seasonality as well as cyclic patterns(repeated with non-fixed period, like holidays, event days, etc). The aforementioned drawbacks can be largely overcome by a relatively new method called additive model time-series analysis(additive model analysis is often used for short in later context without confusion). This method is provided in the SAP HANA Predictive Analysis Library(PAL), and wrapped up in the Python machine learning client for SAP HANA(hana_ml).

In this blog post, we show readers how to apply additive model analysis to analyze time-series with specific case studies.

The rest of this blog post is organized as follows: firstly we give a brief introduction to additive model analysis, then we go to two specific case studies, inclusive of data description, modeling and forecasting, and finally we summarize the main content of this blog post.

Introduction


Additive model analysis is a newly emerged approach for time-series modeling. Unlike traditional approaches(like ARIMA and exponential smoothing) that explore time-based dependencies among observations, it treats time-series modeling as a curve-fitting problem, and uses an additive model to fit/forecast time-series data. Under this setting, the given time-series would be decomposed into four components: trend, seasonality, cyclic patterns, and a random component. The formula is as follows:

                                   𝑦(𝑡)=𝑔(𝑡)+𝑠(𝑡)+(𝑡)+ϵ(𝑡).

It can handle data with complicated trend(using piecewise linear approximation), multiple seasonality(daily, weekly, yearly and beyond, event if period being non-integer) with strong seasonal effects. Besides, the modeling of cyclic patterns also helps to explain variations with non-fixing period. As a consequence, additive model time-series analysis often has better performance when comparing with other classical methods for time-series modeling/analysis.

Case Study I : Additive Model Analysis for Electricity System Demand in Singapore

 

Dataset Description


The first dataset of our interest in this blog post is the historical electricity system demand for every half-hour period(MW) in Singapore, which is available in Singapore's open data portal[3]. The original data is stored in a .csv file[4], and it has been preprocessed and stored in a table with name 'SG_SYSTEM_DEMAND_DATA_TBL' in SAP HANA. Then, the dataset can be accessed via a ConnectionContext to SAP HANA under hana_ml.

import hana_ml

from hana_ml.dataframe import ConnectionContext

cc = ConnectionContext('lsvxc0103.sjc.sap.corp', 30315, 'PAL_USER', 'Welcome1')

df = cc.table('SG_SYSTEM_DEMAND_DATA_TBL')

data = df.sort('datetime').collect()

 

data

 

 
 datetimesystem_demand_actual
02012-02-06 00:30:004526.68
12012-02-06 01:00:004436.09
22012-02-06 01:30:004361.24
32012-02-06 02:00:004295.23
42012-02-06 02:30:004233.17
.........
759312016-06-05 22:00:005862.00
759322016-06-05 22:30:005756.00
759332016-06-05 23:00:005674.00
759342016-06-05 23:30:005566.00
759352016-06-06 00:00:005446.00

75936 rows × 2 columns


In the following context we draw the run-sequence plot of the full time-series.

import matplotlib.pyplot as plt

plt.figure(figsize=(16, 8))

plt.plot(data['datetime'], data['system_demand_actual'])

plt.title('Figure 1. Overview of the dataset')

plt.show()

plt.close()


A coarse examination of the plot indicates that the data have approximately linear trend and yearly seasonality. Now let us zoom in to see if we can have other findings. In the following we take out data of the first 5 weeks for detailed examination.


df_5wk = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2012-03-12 00:00:00\''.format(df.select_statement))

data_5wk = df_5wk.sort('datetime').collect()

Now the 5-week data has been collected to the python client, and we can draw the run-sequence plot in python.
plt.figure(figsize=(16, 4))

plt.plot(data_5wk['datetime'], data_5wk['system_demand_actual'])

plt.title('Figure 2: Zoom in view of data of 1st 5 weeks')

plt.show()

plt.close()



The plot shows that the data also have strong weekly seasonality, and(weak) daily seasonality caused by the similarity of demand pattern between weekdays(the demand pattern in weekends are also simliar, yet with smaller magnitude).

In summary, the given time-series for analysis contains multiple seasonality, inclusive of yearly seasonality(with non-integer period 365.25). All these characteristics make the analysis of this time-series difficult when using traditional approaches like ARIMA, exponential smoothing as well as seasonal decompose.

Modeling and Forecast using Additive Model Analysis


Additive models analysis is a new method that treats time-series modeling as a curve-fitting problem with respect to time. In contrast, exponential smoothing and ARIMA try model the dependencies of the current data with the past(inclusive of expected values and errors). Besides, additive model analysis use (partial) Fourier series to model periodic patterns in the input data. By doing so, the problem of non-integer periods is easily resolved, and the problem of multiple periods also become non-critical. Moreover, additive model analysis also allows users to input events(usually cyclical, called holidays in the Python API) that could potentially affect the the shape of the curve for given time-series during a short period of time and break up the regular seasonal pattern. This often leads to better modeling for time-series.

Now let us apply additive model analysis to our dataset of electricity system demand. As a standard procedure in machine learning, we firstly split the dataset into two parts for training and testing purposes, respectively.
df_train = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2016-02-18 00:00:00\''.format(df.select_statement))

df_train.sort('datetime').collect()

 datetimesystem_demand_actual
02012-02-06 00:30:004526.68
12012-02-06 01:00:004436.09
22012-02-06 01:30:004361.24
32012-02-06 02:00:004295.23
42012-02-06 02:30:004233.17
.........
706992016-02-17 22:00:005903.00
707002016-02-17 22:30:005745.00
707012016-02-17 23:00:005597.00
707022016-02-17 23:30:005441.00
707032016-02-18 00:00:005321.00

70704 rows × 2 columns



Now let us use the training set to fit a model for additive model analysis.
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast # import the python api for additive model analysis from hana_ml

amf = AdditiveModelForecast(seasonality_mode='additive', daily_seasonality='true', weekly_seasonality='true', yearly_seasonality='true')

amf.fit(df_train)

 

 datetimesystem_demand_actual
02012-02-06 00:30:004526.68
12012-02-06 01:00:004436.09
22012-02-06 01:30:004361.24
32012-02-06 02:00:004295.23
42012-02-06 02:30:004233.17
.........
706992016-02-17 22:00:005903.00
707002016-02-17 22:30:005745.00
707012016-02-17 23:00:005597.00
707022016-02-17 23:30:005441.00
707032016-02-18 00:00:005321.00

70704 rows × 2 columns

 

The model fitting procedure takes some time. After it is done, we may apply the trained additive model to the test data, in which case only time-stamps matter irrespective of the demand values.

 

pred_res = amf.predict(df_test)

pred_data = pred_res.sort('datetime').collect()

 

import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.plot(data['datetime'][50000:], data['system_demand_actual'][50000:])# use partial data for better illustration of the prediction result

plt.plot(pred_data['datetime'], pred_data['YHAT'], color='red')

plt.legend(['Actual Demand', 'Predicted Demand'])

plt.title('Figure 3. Actual Demand VS Predicted Demand')

plt.show()

plt.close()

 


We can join the real demand values with predicted ones, and compute the prediction accuracy measures.

rename_pred = pred_res.rename_columns({'datetime': 'date_time'})

join_pred = df_test.join(rename_pred[['date_time', 'YHAT']], #YHAT is the column that stores the predicted values

                         '"datetime"="date_time"')

join_pred.collect()
 

 datetimesystem_demand_actualdate_timeYHAT
02016-02-18 00:30:005164.02016-02-18 00:30:005218.050311
12016-02-18 01:00:005044.02016-02-18 01:00:005103.471702
22016-02-18 01:30:004943.02016-02-18 01:30:005010.291241
32016-02-18 02:00:004866.02016-02-18 02:00:004937.996240
42016-02-18 02:30:004812.02016-02-18 02:30:004882.707945
...............
52272016-06-05 22:00:005862.02016-06-05 22:00:005777.180270
52282016-06-05 22:30:005756.02016-06-05 22:30:005688.704385
52292016-06-05 23:00:005674.02016-06-05 23:00:005577.197404
52302016-06-05 23:30:005566.02016-06-05 23:30:005452.238527
52312016-06-06 00:00:005446.02016-06-06 00:00:005325.501001

5232 rows × 4 columns



from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

res = accuracy_measure(data = join_pred[['system_demand_actual', 'YHAT']],

                       evaluation_metric = ['rmse', 'mad', 'mape'])

res.collect()

 STAT_NAMESTAT_VALUE
0MAD195.786364
1MAPE0.033426
2RMSE244.762496


MAD and RMSE are both small compared with the demand values in the dataset (~200 vs ~5000), consistent with the mean absolute percent error(MAPE) which is around 3.3%.

For the sake of comparison, we apply AutoExponentialSmoothing on the same set of training data, and then evaluate the performance of the generated model on test data.

For saving computational time and improving model quality, we can pre-specify a few parameters when initializing the AutoExponentialSmoothing instance: since the data contains both trend and seasonality components, triple exponential smoothing should be chosen to handle them; seasonal components do not vary with level, which suggests additive seasonality, so the seasonal parameter should be set to 'additive'.
from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing

aesm = AutoExponentialSmoothing(forecast_model_name='TESM',

                                forecast_num=df_test.count(),

                                seasonal='additive',

                                optimizer_random_seed=1)

df_train_wid = df_train.add_id('ID', 'datetime')

aesm.fit_predict(data=df_train_wid, key='ID',

                 endog='system_demand_actual')

aesm.forecast_.sort('TIMESTAMP').collect()


 TIMESTAMPVALUEPI1_LOWERPI1_UPPERPI2_LOWERPI2_UPPER
014710.426403NaNNaNNaNNaN
124420.592040NaNNaNNaNNaN
234344.739633NaNNaNNaNNaN
344294.175059NaNNaNNaNNaN
454237.244328NaNNaNNaNNaN
.....................
75931759325611.2153243185.3980498037.0326001901.2479539321.182696
75932759335505.2651853079.2159187931.3144531794.9430139215.587358
75933759345402.1568122975.8755767828.4380481691.4798739112.833751
75934759355295.2537622868.7405787721.7669451584.2220919006.285433
75935759365185.5587222758.8136147612.3038311474.1723538896.945092

75936 rows × 6 columns


Again the predicted demand values and the actual ones can be joined for prediction accuracy measure computation.
df_wid = df.add_id('ID', 'datetime')

df_test_wid = cc.sql('SELECT * FROM ({}) WHERE ID >= {}'.format(df_wid.select_statement, df_train.count()))

aesm_join_res = df_test_wid.join(aesm.forecast_[['TIMESTAMP', 'VALUE']],##the VALUE column stores fitted & predicted demand values

                                 'ID = TIMESTAMP')

aesm_join_res.collect()

 IDdatetimesystem_demand_actualTIMESTAMPVALUE
0707042016-02-18 00:00:005321.0707045305.413617
1707052016-02-18 00:30:005164.0707055170.236697
2707062016-02-18 01:00:005044.0707065053.972515
3707072016-02-18 01:30:004943.0707074958.628810
4707082016-02-18 02:00:004866.0707084887.215553
..................
5228759322016-06-05 22:00:005862.0759325611.215324
5229759332016-06-05 22:30:005756.0759335505.265185
5230759342016-06-05 23:00:005674.0759345402.156812
5231759352016-06-05 23:30:005566.0759355295.253762
5232759362016-06-06 00:00:005446.0759365185.558722

5233 rows × 5 columns


aesm_measure = accuracy_measure(data = aesm_join_res[['system_demand_actual', 'VALUE']], 

                                evaluation_metric = ['rmse', 'mad', 'mape'])

aesm_measure.collect()

 
 STAT_NAMESTAT_VALUE
0MAD262.475689
1MAPE0.044364
2RMSE300.193255

One can see from the values of selected accuracy measures that the performance of exponential smoothing is not as good as additive model analysis, which is consistent with our previous analysis on the dataset.
 

Adding Holidays for Improved Forecast


Up until now we haven't talked about the holiday effect in additive model time-series analysis. It is seen from Figure 2 that, in weekends the system demands are usually smallers compared to weekdays, we natually suspect that in holidays the systems demand will also deviate from the regular pattern in weekdays. We verify this by checking the residue of the training data(i.e. differences between fitted and actual values) when additive model analysis is applied.
fit_res = amf.predict(df_train) #obtain the fitting result for the training data

fit_res.sort('datetime').collect()

 
 datetimeYHATYHAT_LOWERYHAT_UPPER
02012-02-06 00:30:004542.9129284225.8635754856.542969
12012-02-06 01:00:004442.6914574116.7978744751.370439
22012-02-06 01:30:004364.0448214051.5496574685.978732
32012-02-06 02:00:004306.4328613994.7447554599.067855
42012-02-06 02:30:004265.9490113936.2210744602.731532
...............
706992016-02-17 22:00:005857.6955875552.0153966172.163877
707002016-02-17 22:30:005755.6626515469.0786206074.457734
707012016-02-17 23:00:005630.2929655292.2857285938.692115
707022016-02-17 23:30:005491.1904235176.5923945808.430988
707032016-02-18 00:00:005350.0536605030.3609515667.936164

70704 rows × 4 columns

We join the fitted values with actual ones for computing their differences.


fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})

fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')

fit_join.collect()


 datetimesystem_demand_actualdate_timeYHATYHAT_LOWERYHAT_UPPER
02012-02-06 00:30:004526.682012-02-06 00:30:004542.9129284225.8635754856.542969
12012-02-06 01:00:004436.092012-02-06 01:00:004442.6914574116.7978744751.370439
22012-02-06 01:30:004361.242012-02-06 01:30:004364.0448214051.5496574685.978732
32012-02-06 02:00:004295.232012-02-06 02:00:004306.4328613994.7447554599.067855
42012-02-06 02:30:004233.172012-02-06 02:30:004265.9490113936.2210744602.731532
.....................
7076992016-02-17 22:00:005903.002016-02-17 22:00:005857.6955875552.0153966172.163877
707002016-02-17 22:30:005745.002016-02-17 22:30:005755.6626515469.0786206074.457734
707012016-02-17 23:00:005597.002016-02-17 23:00:005630.2929655292.2857285938.692115
707022016-02-17 23:30:005441.002016-02-17 23:30:005491.1904235176.5923945808.430988
707032016-02-18 00:00:005321.002016-02-18 00:00:005350.0536605030.3609515667.936164

70704 rows × 6 columns


residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))

import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.plot(data['datetime'][0:df_train.count()], residue.collect()['RESID'])

plt.title('Figure 4. Overview of the random component of training data')

plt.show()

plt.close()


One can see that there are many(mainly upward w.r.t. zero value) large spiky values in the residual, which indicates that in certain days the actual demand values would fall below the predicted values by a large amount. For further examination, we use the deviation of the residual value from 0 as the measurement of how abnormality an data point is, then the date information of the top 100 result in the dataset is extracted.
abnormal = cc.sql('SELECT * FROM ({}) ORDER BY RESID DESC LIMIT 100'.format(residue.select_statement))

abnormal_dates = abnormal[['datetime']].cast('datetime', 'DATE').drop_duplicates()

abnormal_dates.sort('datetime').collect()

 
 datetime
02013-02-11
12013-02-12
22014-01-31
32014-10-22
42015-02-19
52015-02-20
62016-02-08
72016-02-09

The dates listed above are all public holidays in Singapore(2014-10-22 is Deepavali Day,  others are for Lunar New Year), which indicates that system demand will fall below the regular level if the day for inspection happens to be a public holiday. This justifies our previous suspection.


However, since major public holidays of a year usually shall be announced even before the year actuall starts, it is okay for us to incorporate them in the model building and model prediction phase. Additive model analysis provides us this facility, so we can collect the set of holidays(exclusive of weekends) within the dates range of the input data and store it in SAP HANA.


holidays = cc.table('SG_HOLIDAY_TBL').cast('Date', 'DATE')#The Date column in holiday table is of type 'TIMESTAMP', convert to 'DATE'

holidays.head(5).collect()
 
 DATENAME
02012-01-02NY
12012-01-23CNY
22012-01-24CNY
32012-04-06GF
42012-05-01LD

where NY = New Year, CNY = Chinese New Year, GF = Good Friday, LD = Labor Day.

Now let's re-fit the additive time-series model with the introduction of holidays.
amf.fit(data = df_train, holiday=holidays)

We re-plot the residue component.
fit_res = amf.predict(df_train)

fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})

fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')

residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))

import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.plot(data['date'][0:df_train.count()], residue.collect()['RESID'])

plt.title('Figure 4. Overview of random component of training data with holiday info.')

plt.show()

plt.close()


We can see from the residue component of the training data that, after public holiday information is introduced, those original excessively large residual values become less frequent, indicating smaller fitting error. In general, the introduction of holidays helps the model in capturing the associated cyclic patterns.

Let us verify whether the prediction error could be reduced as well by apply the re-fitted additive model on the test data.
pred_res2 = amf.predict(df_test)

renm_pred2 = pred_res2.rename_columns({'datetime': 'date_time'})

join_pred2 = df_test.join(renm_pred2[['date_time', 'YHAT']], '"datetime"="date_time"')

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

res = accuracy_measure(data = join_pred2[['system_demand_actual', 'YHAT']],

                       evaluation_metric = ['rmse', 'mad', 'mape'])

res.collect()

 STAT_NAMESTAT_VALUE
0MAD175.825074
1MAPE0.030518
2RMSE220.948932

One can see that all selected accuracy measures improve by a non-negligible amount on the test data thanks to the addition of holiday information:
    • MAD : from ~195.8 to ~175.8,
    • MAPE : from ~0.0335 to ~0.0305,
    • RMSE : from ~224.8 to ~220.9

Case Study II :Additive Model Analysis for International Air Arrivals in Singapore

 

Dataset Description


The data we use in this section is the monthly international visitor arrivals at Changi airport in Singapore[5], available also in Singapore's open data portal[3].  The time-series data have an overall upward trend, while the growth rate could be different within different time-intervals. The data has been preprocessed and stored in HANA database in a table with the name of 'SG_INTER_VISITORS_TBL'.

 monthtotal_international_visitor_arrivals
01978-01-01167016
11978-02-01147954
21978-03-01163199
31978-04-01162400
41978-05-01162667
.........
4502015-07-011519188
4512015-08-011445052
4522015-09-011131967
4532015-10-011246706
4542015-11-011193906

455 rows × 2 columns


The dataset contains two columns, with the 1st column being month info(represented by the 1st day of that month) and 2nd column being to total number of international visitor arrivals. The range of time for this time-series data is from Jan. 1978 to Nov. 2015, with a total number of 455 records.

To proceed further, we divide dataset into training part and testing part, respectively. All records from year 1978 to year 2011, i.e. the heading 408 records, are used for training an additive time-series model, while the rest of records are used for model evaluation.
train_num = 408

test_num = psg.count() - train_num

psg_train = psg.head(train_num)

psg_test = psg.sort('month', desc=True).head(test_num).sort('month')

In the following we examine the training data by drawing its run-sequence plot.
psg_train_data = psg_train.collect()

import matplotlib.pyplot as plt

plt.figure(figsize=(16*(408/455),5))

plt.plot(psg_train_data['month'],

         psg_train_data['total_international_visitor_arrivals'])

plt.show()

plt.close()


One can see that in general the number of air arrivals keep increasing with time. However, the increasing rate is non-stationary, and there are also some intermittent drops that are significant, making the trend of this time-series difficult to model.

Additive model time-series analysis also gives us the flexibility in modeling the trend in a picewise manner. If not specified manually, it will automatically identifying a collection of change-points for the input time-series, where the model of trend is allowed to change when tranversing through a change-point. However, for time-series with complicated trend, the detected change-points produced by the default mechanism are often sub-optimal, leading to over-fitting or under-fitting for the trend of the training data. In contrast, manually labeled change-points(often could be obtained from outer sources) could sometimes alleviate this weakness and do a better job for time-series modeling and prediction.

In this section, we show the benefit of the flexibility provided by additive model analysis for trend modeling, using the above monthly air arrival dataset as an example.
 

Modeling and Forecast with Additive Model Analaysis


One can see from the  run-sequence plot of the air arrival data that,  in April, May and June 2003, the number of international arrivals undergoes a significant drop from its regular level. This is due to the short-term outbreak of SARS cronavirus in Singapore. From the discussion of our previous section, we may include this event when dealing with this time-series. So we store the corresponding values of the two months in a table named by 'SG_SARS_MONTH_TBL' in SAP HANA.
event = cc.table('SG_SARS_MONTH_TBL').cast('Month', 'DATE')

event.collect()

 MonthName
02004-04-01SARS
12004-05-01SARS
22004-06-01SARS

It is seen from the training data that,  local variations increases with the growing number of international arrivals, which is typically regards a single a multiplicative seasonality(if there is any).

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast

amf = AdditiveModelForecast(seasonality_mode='multiplicative')

amf.fit(data=psg_train, holiday=event)

Then let us check how well the model fits the training data.
psg_fit = amf.predict(psg_train)# obtain fitted value for training data

psg_fit_rename_month = psg_fit.rename_columns({'month': 'Month'})

join_true_and_fit = psg_train.join(psg_fit_rename_month, '"month" = "Month"')

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

fit_res = accuracy_measure(data = join_true_and_fit[['total_international_visitor_arrivals', 'YHAT']], 

                           evaluation_metric = ['rmse', 'mad', 'mape'])

fit_res.collect()
 
 STAT_NAMESTAT_VALUE
0MAD31945.872096
1MAPE0.067876
2RMSE51823.310879

Next we apply the trained model to the test data, and evaluate the prediction result using popular evaluation metrics.
psg_pred = amf.predict(psg_test)

psg_pred_rename_month = psg_pred.rename_columns({'month': 'Month'})

join_true_and_pred = psg_test.join(psg_pred_rename_month, '"month" = "Month"')

psg_pred_acc = accuracy_measure(data = join_true_and_pred[['total_international_visitor_arrivals', 'YHAT']],

                                evaluation_metric = ['rmse', 'mad', 'mape'])

psg_pred_acc.collect()


 STAT_NAMESTAT_VALUE
0MAD119813.052030
1MAPE0.093767
2RMSE139003.141614
 
Both MAD and RMSE are around 1e+5, one order smaller compared to the number of arrivals(~ 1e+6), consistent with MAPE(~ 0.09).

Since the time-series of interest in this section has larger variations when its level value goes higher, so it is natual that MSE and RMSE have larger values on the test data since the level is expected to be higher within the prediction period. However, mean absolute percentage error is scale-invariant w.r.t. level of time-series, so compared to that of training data, its higher value on the test data strongly suggests that the training data could be over-fitted by the additive time-series model.

For better illustrattion, the real and predicted values are plotted as follows:
pred_data = psg_pred.collect()

psg_test_data = psg_test.collect()

import matplotlib.pyplot as plt

plt.figure(figsize=(16,5))

plt.plot(psg_test_data['month'], psg_test_data['total_international_visitor_arrivals'])

plt.plot(pred_data['month'], pred_data['YHAT'], color='red')

plt.show()

plt.close()


As discussed in the beginning of this section, when building the above additive model, a set of change-points is detected implicitly using a defaulted mechanism, and the detected change-points may not be optimal. So for the input time-series, after a careful examination of training data, we suggest the following set of changepoints:

'1982-11-01', '1986-01-01', '1991-05-01', '1995-09-01', '1997-10-01', '2000-07-01', '2004-02-01', '2007-12-01', '2009-03-01'.

Let us first mark out the specified change-points in the training data via run-sequence plot.
import pandas as pd

psg_train_data = psg_train.collect()

changepoints = list(pd.to_datetime(pd.Series(['1982-11-01', '1986-01-01', '1991-05-01', '1995-09-01', '1997-10-01', '2000-07-01', '2004-02-01', '2007-12-01', '2009-03-01'])))

event_points = list(pd.to_datetime(pd.Series(['2003-04-01', '2003-05-01', '2003-06-01'])))

cp_data = psg_train_data.loc[[x in changepoints for x in psg_train_data['month']],:]

event_data = psg_train_data.loc[[x in event_points for x in psg_train_data['month']],:]

plt.figure(figsize=(int(16*(408/455)),5))

plt.plot(psg_train_data['month'], psg_train_data['total_international_visitor_arrivals'])

plt.scatter(cp_data['month'], cp_data['total_international_visitor_arrivals'], color='red')

plt.scatter(event_data['month'], event_data['total_international_visitor_arrivals'], color='purple')

plt.legend(['Visitor Arrivals', 'Specified Change-points', 'Event Points'])

plt.show()

plt.close()


Next, we manually specify these points as change-points when building up the additive model for time-series analysis.
amf = AdditiveModelForecast(seasonality_mode='multiplicative',

                            changepoints=['1982-11-01 00:00:00',

                                          '1986-01-01 00:00:00',

                                          '1991-05-01 00:00:00',

                                          '1995-09-01 00:00:00',

                                          '2000-07-01 00:00:00',

                                          '2004-02-01 00:00:00',

                                          '2007-12-01 00:00:00',

                                          '2009-03-01 00:00:00'])

amf.fit(data=psg_train, holiday=event)

Again let us check how well the model fits the training data.
psg_fit2 = amf.predict(psg_train)

psg_fit2_rename_month = psg_fit2.rename_columns({'month': 'Month'})

join_true_and_fit2 = psg_train.join(psg_fit2_rename_month, '"month" = "Month"')

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

fit_res2 = accuracy_measure(data = join_true_and_fit2[['total_international_visitor_arrivals', 'YHAT']], 

                            evaluation_metric = ['rmse', 'mad', 'mape'])

fit_res2.collect()

 STAT_NAMESTAT_VALUE
0MAD32760.561142
1MAPE0.070333
2RMSE52156.801417


So the metrics for model fitting on the training dataset is not as good as the case when using default mechanism for change-point detection, but the result are comparable.

Now we evaluate the re-fitted model on the test data.

 STAT_NAMESTAT_VALUE
0MAD93115.156652
1MAPE0.073317
2RMSE111212.533575

Thanks to the manually specified change-points, all evaluation metrics have improved on the test data. In particular, mean absolute percentage error(MAPE) has been dropped by nearly 2%, making it close the MAPE value on the training data.

We can check the consistency between real and predicted arrivals by drawing their run-sequence plots.
pred2_data = psg_pred2.collect()

plt.figure(figsize=(16,5))

plt.plot(psg_test_data['month'],

         psg_test_data['total_international_visitor_arrivals'])

plt.plot(pred2_data['month'], pred2_data['YHAT'], color='red')

plt.legend(['Actual Arrivals', 'Predicted Arrivals'])

plt.show()

plt.close()


 

Summary


In this blog post, we have shown readers how additive model analysis faciliates us in handling time-series with trend, seasonality and cyclic patterns. Compared with traditional approaches that rely on stationarity and lagged dependency, the curve-fitting nature of additive time-series models provides us more flexibilities for time-series modeling.