For a long time now we have been waiting to get pivoting functionality with SAP HANA. This was not possible and there have been several attempts to get same functionality with custom sql codes. In this blog post I will describe how pivoting can be achieved via
SAP HANA Python Client API for Machine Learning Algorithms (hana_ml package) without the moving data out of SAP HANA.
With hana_ml version 1.0.7 we can now pivot the data easily and intuitively. This functionality is only supported via hana_ml and not in SAP HANA SQL directly, as the hana_ml wrapper generates the required sql code on the fly.
Lets say I have a SAP HANA table which has sample data as below and we access it via hana_ml dataframe
df_data:
The number of different SKUs is 26
Now we want to run a correlation analysis between the different SKUs to check if their sales are potentially cannibalizing or positively impacting each other.
To do this we need to pivot the data so we have a column for each of the 26 SKUs containing the SALES_VALUE.
#Create the pivoted data as a hana_ml dataframe with pivot_table
dp = df_data.pivot_table(values='SALES_VALUE',index = 'DATE_MEASUREMENT', columns='SKU',aggfunc='sum')
#Format the data to fillin zero and have the right datatype for HANA PAL correlation function
dp = dp.fillna(0,SKUS).cast(SKUS, 'DOUBLE')
This generates a hana_ml dataframe which has SALES_VALUE for each of the 26 SKUs in a column format:
Now we can easily get a correlation matrix between all SKUs as follows, here I am choosing to show only the top 10 SKUs based on their SALES_VALUE but you can run the same analysis for all SKUs very efficiently:
#Focus on only top 10 SKUs
SKUS = list(set(top_10_skus.select('SKU').collect()['SKU']))
from hana_ml.algorithms.pal import stats
corr = stats.pearsonr_matrix(conn_context=conn, data=dp, cols = SKUS)
corr.collect()
This provides the following hana_ml dataframe with correlation coefficient between every pair of SKUs.
We can also easily plot this as a matrix using
hana_ml.visualizers package
Many other plotting capabilities with hana_ml.visualizers is described very nicely in
Arun Godwin Patel's blog post
from hana_ml import visualizers
from hana_ml.visualizers.eda import EDAVisualizer
import matplotlib.pyplot as plt
get_ipython().magic('matplotlib inline')
f = plt.figure(figsize=(10,10))
ax = f.add_subplot()
eda = EDAVisualizer(ax)
ax, corr_data = eda.correlation_plot(data=dp, corr_cols =SKUS, label=True, cmap='RdYlGn')
plt.show()
This produces the following correlation matrix.
Finally we now have a way of pushing down pivoting functionality to SAP HANA without moving data outside or writing customized SQL.