Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
yangzhi
Advisor
Advisor
3,429
In time series, an outlier is a data point that is different from the general behavior of remaining data points. In Predictive Analysis Library (PAL) of SAP HANA, we have automatic outlier detection for time series. You can find more details in Outlier Detection for Time Series in PAL.

In PAL, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original series. In step 2, we detect the outliers from the residual. In step 1, we have an automatic method.

In this blog post, you will learn:

  • introduction of outlier in time series and automatic detection method in PAL

  • some use cases of automatic outlier detection


To make it easy to test the performance of automatic outlier detection, we use Z1 score method (the default method) in step 2 and set the parameter THRESHOLD = 3 (the default value). This parameter value is based on 3 sigma rule.

To make it easy to read and show the results, we call the PAL procedure in Jupyter Notebook. For calling the PAL procedure and plotting the results, we need some functions. We put the functions in the Appendix.

Introduction


In time series, outliers can have many causes, such as data entry error, experimental error, sampling error and natural outlier. Outliers have a huge impact on the result of data analysis, such as the seasonality test. Outlier detection is an important data preprocessing for time series analysis.

In this algorithm, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original series. In step 2, we detect the outliers from the residual. We focus on the automatic method of step 1 in this blog.

In step 1, we have an automatic method. For the automatic method, we combine seasonal decomposition, linear regression, median filter and super smoother. The processes in the automatic method is shown in the picture below.


Processes of Automatic Method


In the output of this algorithm, we have a result table and a statistic table. In the result table, the residual, outlier score and outlier label are included. In the statistic table, some information of the time series and outlier detection method is included. For automatic method, the final smoothing method in step 1 is shown in the statistic table.

Test Cases


To call the PAL procedure with python, we need to import some python packages.
import numpy as np
from datetime import datetime
from pandas import read_table
import matplotlib.pyplot as plt
import pandas as pd

case 1: smooth data without seasonality


data


The data is from spikey_v.dat.  It is smooth, but without seasonality. The plot of the data is shown with the code below.
str_path = 'your data path'
cols = ['j', 'u', 'v', 'temp', 'sal', 'y', 'mn', 'd', 'h', 'mi']

df = read_table(str_path+'spikey_v.dat' , delim_whitespace=True, names=cols)

df.index = [datetime(*x) for x in zip(df['y'], df['mn'], df['d'], df['h'], df['mi'])]
df = df.drop(['y', 'mn', 'd', 'h', 'mi'], axis=1)
data = np.full(len(df), np.nan)
for i in range(len(df)):
data[i] = df['u'][i]
plt.plot(data)
plt.grid()



automatic outlier detection


We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.
Outlier_parameters = {
"AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults



dfStats


The results are plotted as below.
outlier_result_plot(dfOutlierResults)



outlier_plot(dfOutlierResults)


From the statistic table, we can see that the final smoothing method is median filter, as the time series is quite smooth. From the above results, we find that PAL miss detecting an outlier. This is because there are two big outliers here and the standard deviation becomes large. From the plots of residual and outlier score, we can find four outliers very clearly. We can adjust the threshold or choose other methods in step 2 to find all outliers.



case 2: non-smooth data without seasonality


data


The data is from R package "fpp2". It is the gold daily price data from January 1st 1985 to March 31th 1989. The data is neither smooth nor seasonal. There are some missing values. The missing values are imputed by linear interpolation. The plot of the data is shown with the code below.
str_path = 'your data path'
df = pd.read_csv(str_path+'daily_csv_no_missing_value.csv')
num = len(df)
data = np.full(num,np.nan)
for i in range(num):
data[i] = df['Price'][i]
plt.plot(data)
plt.grid()



automatic outlier detection


We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.
Outlier_parameters = {
"AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults



dfStats


The results are plotted as below.
outlier_result_plot(dfOutlierResults)



outlier_plot(dfOutlierResults)


From the statistic table, we can see that the final smoothing method is super smoother, as the time series is not so smooth. From the above results, we find that PAL detect the outlier at t = 769 successfully and also consider some other points as outliers. From the plots of residual and outlier score, we can find that the outlier at t = 769 is very obvious. We can adjust the threshold or choose other methods in step 2 to only detect the outlier at t = 769.

case 3: smooth data with seasonality


data


The data is synthetic. It is seasonal with period = 40. There are four obvious outliers in the time series. The time series is as below.
import math
random_seed = 3
np.random.seed(random_seed)
cols = 200 # length of time series
cycle = 40
outlier_idx = [30, 45, 73, 126, 159, 173]
timestamp = np.full(cols,np.nan,dtype = int)
for i in range(cols):
timestamp[i] = i
seasonal = np.full(cols,np.nan,dtype = float)
for i in range(cols):
seasonal[i] = math.sin(2*math.pi/cycle*timestamp[i])
const = np.full(cols,2,dtype = float)
noise = np.full(cols,0,dtype = float)
for i in range(cols):
noise[i] = 0.2*(np.random.rand()-0.5)
trend = 0.01*timestamp
outlier = np.full(cols,0,dtype = float)
for i in outlier_idx:
outlier[i] = 4*(np.random.rand()-0.5)
data = seasonal + const + noise + trend + outlier
plt.plot(data)
plt.grid()



automatic outlier detection


We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.
Outlier_parameters = {
"AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults



dfStats


The results are plotted as below.
outlier_result_plot(dfOutlierResults)



outlier_plot(dfOutlierResults)


From the statistic table, we can see that the final smoothing method is median filter and seasonal decomposition, followed by super smoother, as the time series is quite smooth and seasonal. From the above results, we can see that the four obvious outliers are detected by PAL.

case 4: non-smooth data with seasonality


data


The data is monthly ice cream data. The period is 12. You can find the data in ice_cream_interest.csv. There are two obvious outliers in the time series. The plot of the time series is as below.
str_path = 'your data path'
df = pd.read_csv(str_path+'ice_cream_interest.csv')
data = np.full(len(df),np.nan,dtype = float)
for i in range(len(df)):
data[i] = df['interest'][i]
plt.plot(data)
plt.grid()



automatic outlier detection


We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.
Outlier_parameters = {
"AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults



dfStats


The results are plotted as below.
outlier_result_plot(dfOutlierResults)



outlier_plot(dfOutlierResults)


From the statistic table, we can see that the final smoothing method is seasonal decomposition, followed by super smoother, as the time series is seasonal, but not so smooth. From the above results, we can see that the two obvious outliers are detected by PAL.

Conclusions


In this blog post, we describe what is outlier in time series and provide an automatic outlier detection method for time series in PAL. We also provide some examples to show how to call the automatic outlier detection procedure and show the detection results. From the above results, we can see that the automatic method can detect outliers in different time series. Hope you enjoy reading this blog!

The method will also be included in hana-ml. If you want to learn more about the automatic outlier detection method for time series in SAP HANA Predictive Analysis Library (PAL) and hana-ml, please refer to the following links:

Outlier Detection for Time Series in PAL

Outlier Detection for Time Series in hana-ml (automatic method will be included after 2023 Q3)

 

Other Useful Links:


Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA

Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA

Anomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for ...

Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA


Appendix


SAP HANA Connection


import hana_ml
from hana_ml import dataframe
conn = dataframe.ConnectionContext('host', 'port', 'username', 'password')

Functions for Table


def createEmptyTable(table_name, proto, cc):
with cc.connection.cursor() as cur:
try:
joint = []
for key in proto:
joint.append(" ".join(['"{:s}"'.format(key), proto[key]]))
cur.execute('CREATE COLUMN TABLE %s (%s);' %
(table_name, ",".join(joint)))
except:
print(
f"\"CREATE TABLE {table_name}\" was unsuccessful. Maybe the table has existed.")

def dropTable(table_name, cc):
with cc.connection.cursor() as cur:
try:
cur.execute(f"DROP TABLE {table_name}")
except:
print(f"\"DROP TABLE {table_name}\" was unsuccessful. Maybe the table does not exist yet.")


def createTableFromDataFrame(df, table_name, cc):
dropTable(table_name, cc)
dt_ml = dataframe.create_dataframe_from_pandas(cc, df, table_name=table_name, table_structure={"MODEL_CONTENT":"NCLOB"})
# dt_ml = dataframe.create_dataframe_from_pandas(cc, df, table_name=table_name, table_structure={"COL1":"CLOB"})
return dt_ml

Function of Calling the PAL Procedure of Outlier Detection for Time Series


def OutlierDetectionForTimeSeries(df, parameters, cc,
data_table='ZPD_PAL_DATA_TBL',
parameter_table='ZPD_PAL_PARAMETERS_TBL',
result_table='ZPD_PAL_RESULT_TBL',
stats_table='ZPD_PAL_STATS_TBL',
metri_table='ZPD_PAL_METRI_TBL'):

# Input table
createTableFromDataFrame(df, data_table, cc)

# Result table
dropTable(result_table, cc)
createEmptyTable(result_table, {
"TIMESTAMP": "INTEGER","RAW_DATA":"DOUBLE","RESIDUAL":"DOUBLE","OUTLIER_SCORE":"DOUBLE","IS_OUTLIER":"INTEGER"}, cc)

# Metri table
dropTable(metri_table, cc)
createEmptyTable(metri_table, {
"STAT_NAME": "NVARCHAR(1000)","VALUE":"DOUBLE"}, cc)

# Stats table
dropTable(stats_table, cc)
createEmptyTable(stats_table, {
"STAT_NAME": "NVARCHAR(1000)", "STAT_VALUE": "NVARCHAR(1000)"}, cc)

# Parameter table
dropTable(parameter_table, cc)
createEmptyTable(parameter_table, {"PARAM_NAME": "nvarchar(256)", "INT_VALUE": "integer",
"DOUBLE_VALUE": "double", "STRING_VALUE": "nvarchar(1000)"}, cc)

if parameters:
with cc.connection.cursor() as cur:
for parName, parValue in parameters.items():

if isinstance(parValue, str):
parValue = f"'{parValue}'"
parametersSQL = f"{parValue if isinstance(parValue,int) else 'NULL'}, {parValue if isinstance(parValue,float) else 'NULL'}, { parValue if isinstance(parValue,str) else 'NULL'}"
cur.execute(
f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")

elif isinstance(parValue,list):
for x in parValue:
if isinstance(x, str):
x = f"'{x}'"
parametersSQL = f"{x if isinstance(x,int) else 'NULL'}, {x if isinstance(x,float) else 'NULL'}, { x if isinstance(x,str) else 'NULL'}"
cur.execute(
f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")
else:
parametersSQL = f"{parValue if isinstance(parValue,int) else 'NULL'}, {parValue if isinstance(parValue,float) else 'NULL'}, { parValue if isinstance(parValue,str) else 'NULL'}"
cur.execute(
f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")

else:
print("No parameters given using default values.")

sql_str = f"\
do begin \
lt_data = select * from {data_table}; \
lt_control = select * from {parameter_table};\
CALL _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES(:lt_data, :lt_control, lt_res, lt_stats, lt_metri); \
INSERT INTO {result_table} SELECT * FROM :lt_res; \
INSERT INTO {stats_table} SELECT * FROM :lt_stats;\
INSERT INTO {metri_table} SELECT * FROM :lt_metri; \
end;"

with cc.connection.cursor() as cur:
cur.execute(sql_str)

return cc.table(result_table).collect(), cc.table(stats_table).collect(), cc.table(metri_table).collect()

Functions of Plotting Results


def outlier_result_plot(dResults):
dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
raw_data = np.array(dResults['RAW_DATA'])
residual = np.array(dResults['RESIDUAL'])
outlier_score = np.array(dResults['OUTLIER_SCORE'])
is_outlier = np.array(dResults['IS_OUTLIER'])
plt.figure(figsize = (24,4.5))
plt.subplot(1,4,1)
plt.plot(raw_data)
plt.grid()
plt.title('RAW_DATA')
plt.subplot(1,4,2)
plt.plot(residual)
plt.grid()
plt.title('RESIDUAL')
plt.subplot(1,4,3)
plt.plot(outlier_score)
plt.grid()
plt.title('OUTLIER_SCORE')
plt.subplot(1,4,4)
plt.plot(is_outlier)
plt.grid()
plt.title('IS_OUTLIER')

def outlier_plot(dResults):
dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
raw_data = np.array(dResults['RAW_DATA'])
is_outlier = np.array(dResults['IS_OUTLIER'])
outlier_idx = np.array([],dtype = int)
for i in range(len(is_outlier)):
if is_outlier[i] == 1:
outlier_idx = np.append(outlier_idx,i)
plt.plot(raw_data)
plt.scatter(outlier_idx,raw_data[outlier_idx],color = 'red')
plt.grid()
plt.title("series and outlier")
2 Comments