Updates for the Data Scientist, building SAP HANA ...

ChristophMorgen · ‎02-24-2020

In April 2019 the first version of the R and Python client APIs, exposing SAP HANA's embedded Machine Learning capabilities to Data Scientists in their expert languages, has been released (see this blog).

Allowing Data Scientists to stay in their working environment and thus leveraging their expertise and productivity to contribute to SAP HANA embedded applications has been an extremely popular and successful approach, with numerous customer interactions, Data Scientist experts blogs released or the integration of the Python client API for HANA-ML within SAP Data Intelligence, servicing HANA embedded use cases.

Last year's updates of the HANA-ML package in Python included new dataframe methods for e.g. data pivoting or pandas to HANA dataframe conversion. Furthermore the coverage of PAL and APL algorithms supported broadened significantly as well as exploratory data analysis visualizations had been introduced.

Updated HANA-ML packages

With SAP HANA 2 SPS04 revision 46, a new update to the HANA-ML R (1.0.7) and Python (1.0.8) APIs has been released, with the Python API also being made available via PyPI/hana-ml repository for easiest install and consumption.

Python API HANA-ML 1.0.8 enhancements

The Predictive Analysis Library (PAL) algorithm coverage has now been completed, including preprocessing- (discretize, scaling, sampling, ...) , statistics- (factor analysis, ...) , time series analysis- (forecast accuracy measures, ...) and other missing functions (ABC analysis, TSNE, ...). The exploratory data analysis visualizations have been enhanced with a data profiler function.

As a key productivity enhancement for Data Scientists, cross validation and optimal parameter selection is now supported for the PAL functions like e.g. Random Forest- or Hybrid Gradient Boosting- classifier via the Python interface. Code example:

hgbc = HybridGradientBoostingClassifier(conn_context = conn,

           n_estimators = 4, split_threshold=0,

           learning_rate=0.5, fold_num=5, max_depth=6,

           evaluation_metric = 'error_rate', ref_metric=['auc'],

           param_range=[('learning_rate',[0.1, 0.45, 1.0]),

                        ('n_estimators', [4, 3, 10]),

                        ('split_threshold', [0.1, 0.45, 1.0])])

hgbc.fit(df, features=['ATT1', 'ATT2', 'ATT3', 'ATT4'], label='LABEL')

hgbc.stats_.collect()

The dataframe methods have been further improved with data manipulation options when saving to dataframes, pivoting data or saving PAL or APL models using a unified model storage function, saving your predictive models to your sandbox model repository tables.

The APL support in HANA-ML covers since 1.0.7 gradient boosting-based classification (multinomial and binary) and -regression, time series forecasting and clustering.

You can find the Python API for HANA-ML documentation here, moreover samples scripts and notebooks can be found in the hana-ml-samples repository for PAL as well as for APL.

R API HANA-ML 1.0.7 enhancements

The new R client API package provides a major update towards PAL function coverage. It now includes clustering functions (e.g DBSCAN), more classifiers (e.g. Hybrid Gradient Boosting), more regressors, association analysis, more time series forecasting (e.g. Auto-ARIMA), preprocessing functions (discretize, scaling, sampling, ...). PAL predictive models can be saved to model storage tables in your SAP HANA environment.

A key enhancement for the R client API, it now supports a JDBC connector providing more stable and improved HANA connectivity. The HANA dataframe in R includes many new functions as well, like the describe function providing a series of descriptive univariate column statistics of the underlying data, calculated in SAP HANA.

The complete overview and details of functions supported in R API can be found in the reference documentation here.

Tooling support for Data Scientists and ML scenario operationalization

While you can certainly install and consume the HANA-ML packages in your local or custom environment, SAP Data Intelligence provides a Jupyter-Lab Python environment, out of the box prepared to leverage SAP HANA's embedded Machine Learning capabilities, see the following blog for details on this.

The R and Python client APIs help to connect the Data Scientists workbench with SAP HANA, which in this case is the target platform to operationalize a ML scenario, thus providing a huge asset as it enables the productivity of Data Scientists to be directly applied in SAP HANA.

Additional recommended reads

Python client API blogs