In datasets with multiple features, one typical type of outliers are those corresponding to extreme values in numerical features. In this blog post, we will show how to use statistical tests in the Python machine learning client for SAP HANA(hana_ml) to detect such outliers. Note that the statistical test algorithms are originally implemented in the SAP HANA Predictive Analysis Library(PAL), while hana_ml provides us a friendly enough Python API for such algorithms.
In this blog post, you will learn:
import hana_ml from hana_ml.dataframe import ConnectionContext cc = ConnectionContext(address='xx.xx.xx.xx', port=30x15, user='XXX', password='Xxxxxxx')#detailed account info is hidded away
df = cc.table('PAL_GAUSSIAN_2D_DATA_TBL')df.collect()

data = df.collect() import matplotlib.pyplot as plt plt.scatter(data['X'], data['Y']) plt.show()

Variance test is categorized as a preprocessing algorithm in hana_ml, we import it from hana_ml and apply it to the two feature columns X and Y, respectively.
from hana_ml.algorithms.pal.preprocessing import variance_test resX = variance_test(df[['ID', 'X']], sigma_num=3) resY = variance_test(df[['ID', 'Y']], sigma_num=3)
Variance test returns a tuple of two hana_ml DataFrames, where the first one is the outlier detection result, and the second one is related statistics of the data involved in outlier detection. For each detection result, the ID column is there along with a new column of the name 'IS_OUT_OF_RANGE'. The IS_OUT_OF_RANGE column contains 0s and 1s, where 0 is for inlier and 1 for outlier. We can check the detected outliers in X values via a SQL query statement as follows:
x_outlier = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(resX[0].select_statement))
x_outlier.collect()
The detection of outliers in the Y column can be done in a similar way. Consequently, the two detection results could be unified to form the overall detection result of outliers(using the union() function for hana_ml DataFrames). However, there might be some data points that are recognized as outliers in both X and Y values, which creates duplicated values in the unified detection result. We need to apply the drop_duplicates() function for the removal of duplicated values in this case. The entire procedure is illustrated as follows:
y_outlier = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(resY[0].select_statement))
outlier_id = x_outlier.union(y_outlier)
outlier_id = outlier_id.drop_duplicates()
df_outlier = cc.sql('SELECT * FROM ({}) WHERE ID IN (SELECT ID FROM ({}))'.format(df.select_statement,
outlier_id.select_statement))
Finally, we draw the scatter plot of the detected outliers as follows:
df_inlier = cc.sql('SELECT * FROM ({}) WHERE ID NOT IN (SELECT ID FROM ({}))'.format(df.select_statement,
outlier_id.select_statement))
data_outlier = df_outlier.collect() data_inlier = df_inlier.collect() plt.scatter(data_inlier['X'], data_inlier['Y'], label='Inlier') plt.scatter(data_outlier['X'], data_outlier['Y'], color='red', label='Outlier') plt.legend() plt.show()

However, it is known that the effectivity of variance test is easily affected by the existence of extreme outliers. For example, we add an additional point to the 2D Gaussian dataset used above as follows:
new_point = cc.sql('SELECT ID+1 AS ID, X+4000 AS X, Y FROM ({}) WHERE ID = 3999'.format(df.select_statement))
new_df = df.union(new_point)
new_df.collect()
The augmented dataset contains a new data point with ID 4000, which has very large X value(~4000, other data points have X values centered around 0, commonly no greater than 5 in absolute sense). Then, if we apply variance test with the X values in the augmented dataset, we will obtain the following result:
resX = variance_test(new_df[['ID', 'X']], sigma_num=3)
outlier_X = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(resX[0].select_statement))
outlier_X.collect()
So the variance test on X results in the detection of only the newly added extreme outlier. Intrinsically, this happens because the newly added extreme outlier makes the originally detected outliers look much more 'normal'; while numerically, variance test depends on the calculation of sample mean and variance, both are very sensitive to existence of extreme values in the dataset.
In the following section we introduce an outlier detection approach called inter-quartile-range(IQR) that is much more robust to the existence of extreme outliers.
from hana_ml.algorithms.pal.stats import iqr
iqr_resX1 = iqr(df, key = 'ID', col='X', multiplier=1.8)
iqr_outlier_X1 = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(iqr_resX1[0].select_statement))
iqr_outlier_X1.collect()
iqr_resX = iqr(new_df, key = 'ID', col='X', multiplier=1.8)
iqr_outlier_X = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(iqr_resX[0].select_statement))
iqr_outlier_X.collect()
iqr_resY = iqr(new_df, key = 'ID', col='Y', multiplier=1.8)
iqr_outlier_Y = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(iqr_resY[0].select_statement))
iqr_outlier_id = iqr_outlier_X.union(iqr_outlier_Y).drop_duplicates() # merge the result
df_outlier = cc.sql('SELECT * FROM ({}) WHERE ID IN (SELECT ID FROM ({}))'.format(df.select_statement,
iqr_outlier_id.select_statement))
df_inlier = cc.sql('SELECT * FROM ({}) WHERE ID NOT IN (SELECT ID FROM ({}))'.format(df.select_statement,
iqr_outlier_id.select_statement))
data_outlier = df_outlier.collect()
data_inlier = df_inlier.collect()
plt.scatter(data_inlier['X'], data_inlier['Y'], label='Inlier')
plt.scatter(data_outlier['X'], data_outlier['Y'], color='red', label='Outlier')
plt.legend()
plt.show()

The detection result is quite similar to that of variance test on the original dataset, which is not at all affected by the introduction of the extreme outlier in X-value.
[1] Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA
[2] Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA
[3] Learning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 36 | |
| 34 | |
| 29 | |
| 29 | |
| 28 | |
| 26 | |
| 26 | |
| 25 | |
| 25 | |
| 22 |