Application Development Discussions
Join the discussions or start your own on all things application development, including tools and APIs, programming models, and keeping your skills sharp.
cancel
Showing results for 
Search instead for 
Did you mean: 

Community challenge alert! The SAP HANA Cloud Machine Learning Challenge is about to start!

susenpoppe
Product and Topic Expert
Product and Topic Expert
15,221

Prepare yourself (Prerequisites blogpost) and watch our kick-off call or read our kick-off blogpost.

During the challenge you will get the opportunity working hands-on with SAP HANA Cloud’s machine learning capabilities 🤖🎯.

Besides our open office hours, this thread is the place to ask questions and share your experiences with other machine learning enthusiasts. The challenge team is happy to answer any question you may have in the comments below.

Will you accept the challenge?

55 REPLIES 55

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,962

Hi,

You’ve signed up for ML Community challenge "I quit! - How to prevent employee churn” kick-off on November 28th, that’s fantastic!

Review this blog for preparation details: Prerequisites for the SAP HANA Cloud Machine Learning Challenge “I quit!” – How to prevent employee .... As you will need an SAP HANA Cloud instance (Free tier or trial are not compatible with the challenge contents), and if you don’t have an adequate set-up yourselves, please reach out to the team SAPHANACloud@sap.com to get access to a suitable instance.

Looking forward to see you joining our challenge!

Best regards,

Christoph

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,917

Join our open office hour calls, starting today

During the challenge you can join our experts to get your open questions answered and address any issues you are dealing with regarding the challenge. Please find the slots offered below:

Or simply ask you questions here in the forum!

WenjingLiu
Participant
0 Kudos
12,872

Hi,
I would need to check the data uploaded onto the HANA Cloud database. How can I access SAP HANA Cloud Cockpit or SQL console (SAP HANA Data Explorer) for the following instance? Thanks!

8c7eed2e-f460-4a82-abfa-e3ede36923d7.hna1.prod-eu10.hanacloud.ondemand.com

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,810

Hi, we haven't set up full BTP applications like HANA Cloud Database Explorer available on our side for the participants. However, if you have a local HANA Cockpit / Database Explorer, you can simply establish a SQL Connection to the HANA Cloud instance from there. Alternatively, you can use a tools like DBeaver locally installed and connect (see blog). Certainly from the python side, once the data is uploaded, and you have the dataframe created, you can use dataframe methods to access and check the data, e.g. mydataframe.head(10).collect() would issue a Select top 10 * from <table>. you can see some method examples in Yannick's blog at "How can we explore our data and react to data quality issues early".

Christoph

12,796

If you have a BTP account, e.g. in BTP Trial or a Free Tier, then could you try if generic DBX opens for you at https://hana-cockpit.cfapps.us10.hana.ondemand.com/hrtt/sap/hana/cst/catalog/index.html?

Alternatively, you can use DBeaver -- an open-source IDE for databases: https://blogs.sap.com/2020/04/20/secure-connection-from-dbeaver-to-sap-hana-cloud/

Regards,
-Witalij

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,749

As another helper to get everyone started, find another employee churn data upload and exploration using HANA dataframes notebook in our challenge repository

Sergiu
Contributor
0 Kudos
12,731

Hi.
In pandas dataframe data can be modified with inplace = True or by assigning results to the same dataframe, example df_data = df_data.replace(to_replace = np.nan, value = 'MSV').
In hana dataframe COLLECT allows to show data, but the inplace = True is no supported and assigning  modified data back to same dataframe is not allowed, example data_hdf = data_hdf.fillna('MSV').collect() is not supported.
Does that mean that data has to be modified in pandas and updated to hana wtih dataframe.create_dataframe_from_pandas?
COLLECT shows data. What are the methods that allow to modify hana dataframe? Please recommend notebooks with examples on how to modify hana dataframe.
Thanks. Sergiu.

Vitaliy-R
Developer Advocate
Developer Advocate
12,728

HANA dataframe does not store data, as it is with the Pandas or R dataframe. I had it explained for example here: https://blogs.sap.com/2020/11/10/multi-model-in-hana_ml-2.6-for-python-dataframes/

HANA dataframe in Python represents a SELECT statement to be executed in SAP HANA database and it is executed only once a method like `collect()` is called.

There is a method `save()` that creates a table or a db view for data returned by a HANA dataframe to the database :

https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.06/en-US/hana_ml.dataframe.html#hana_m...

0 Kudos
12,701

Pandas dataframe:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df['C] = 1
df['C'] = df['C] + 2

Sergiu_0-1669906283168.png

hana-ml Tutorial - Dataframe:
d_train_new = d_train.add_constant('C',1) 

I couldn't find in tutorial how to modify a column, tried as in pandas but I got an error.
d_train_new['C'] = d_train_new['C'] + 1
TypeError: __getitem__ argument not understood: 'C'

Please help with examples on how to modify hana dataframe column. 

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,682

you can simply use the select dataframe method to create new columns

df.select('*', ('B * 4', 'D')).collect()

Sergiu
Contributor
0 Kudos
12,738

In presented notebooks in hana model.fit the parameter key is mandatory. 
In Python machine libraries key parameter is not necessary.
What is the role of the key hana libraries?
MODEL_SEARCH.fit(data=d_train, key= ID,
label=target,
partition_method='stratified',
partition_random_state=1,
stratified_column=target,
build_report=False)

Thanks. Sergiu.

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,711

Hi Sergiu, key is actually not mandatory with fitting a model. However if the input dataframe to a fit-call contains and Id column, you either have to drop it from the dataframe beforehand, or identify the column as key / id column so it's not treated as a feature column. With the predict / score methods e,g. of unified classification, an id-column is required. 

Hope this explains.

Christoph

Sergiu
Contributor
0 Kudos
12,455

Estimate car price - Load data to SAP HANA 

Pandas:
# df - pandas dataframe
df_data
.insert(0, 'CAR_ID', df_data.reset_index().index)
OR
df_data[COL_CAR_ID] = df_data.index

Please help with examples on how to copy in hana_ml.dataframe index to column.

Thanks. Sergiu.

12,356
# df - hana_ml.dataframe
df
.add_id(id_col='ID')

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,316

Great you found the solution in between!!!  😊

YannickSchaper
Product and Topic Expert
Product and Topic Expert
12,319

Today we share additional data, which was discovered from our data engineer. The additional data for example includes if an employee recently updated their LinkedIn Profile or took part in an HR training. During the open office calls this week on Tuesday/Thursday (see this post for timings), we will also demo how to use Unified Classification with PAL Hybrid Gradient Boosting (see example) and the new PAL AutoML-Classification feature (see example, blog, community call introduction, more automl examples), as two additional approaches to tackle the employ churn challenge and achieve best results …

Have fun working on the challenge!

susenpoppe
Product and Topic Expert
Product and Topic Expert
0 Kudos
12,239

Join us in 5 minutes for a short demo and to address your open questions in our open office hour.

0 Kudos
12,234

Sergiu_0-1670313891841.png

 

susenpoppe
Product and Topic Expert
Product and Topic Expert
0 Kudos
12,211

12,014

Today we have our second open office session (see this post for timings), have you already include the additonal  data to improve your model? Maybe HRTRAINING, SICKDAYS, ... adds additional insights, you should try it out! 

Sergiu
Contributor
0 Kudos
12,151

hana_ml.dataframe - replace allows to replace values in the same column from a dictionary. 

Examples
>>> df.collect()
Aa Bb Cb
0 0 5 a
1 10 0 b
2 2 7 c
3 3 8 d
4 4 9 e

>>> df.replace(to_replace={'Aa': {0: 100, 4: 400}}).collect()
Aa Bb Cb
0 100 5 a
1 10 0 b
2 2 7 c
3 3 8 d
4 400 9 e
Pandas
>>> df = pd.DataFrame([10,20,30,40,50], columns =['speed'])
>>> df.loc[(df["speed"] <= 20), "gear"] = 'd1'
>>> df.loc[(df["speed"] > 20) & (df["speed"] <= 40), "gear"] = 'd2'
>>> df.loc[(df["speed"] > 40), "gear"] = 'd3'
>>> df

  speed gear
0 10 d1
1 20 d1
2 30 d2
3 40 d2
4 50 d3

Please help with examples for ml_hana.dataframe on how to create an additional column with values based on conditions of other columns as in pandas.

Thanks. Sergiu.

 

Sergiu
Contributor
0 Kudos
12,241

hana_ml.dataframe.DataFrame - replace allows to replace values from same column from a dictionary.

Examples
--------
>>> df.collect()
    Aa  Bb  Cb
    0   0   5   a
    1   10  0   b
    2   2   7   c
    3   3   8   d
    4   4   9   e
>>> df.replace(to_replace={'Aa': {0: 100, 4: 400}}).collect()
    Aa  Bb  Cb
    0   100 5   a
    1   10  0   b
    2   2   7   c
    3   3   8   d
    4   400 9   e
Pandas
>>>df = pd.DataFrame([10,20,30,40,50], columns =['speed'])
>>>df.loc[(df["speed"] <= 20), "gear"] = 'd1'
>>>df.loc[(df["speed"] > 20) & (df["speed"] <= 40), "gear"] = 'd2'
>>>df.loc[(df["speed"] > 40), "gear"] = 'd3'
>>>df
speed gear
0 10 d1
1 20 d1
2 30 d2
3 40 d2
4 50 d3

 Please provide examples for hana_ml.dataframe.DataFrame on how to create conditional column based on values of another column.

Thanks. Sergiu.

0 Kudos
12,213
df = pd.DataFrame([10,20,30,40,50], columns =['SPEED'])
df.loc[(df["SPEED"] <= 20), "GEAR"] = 'd1'
df.loc[(df["SPEED"] > 20) & (df["SPEED"] <= 40), "GEAR"] = 'd2'
df.loc[(df["SPEED"] > 40), "GEAR"] = 'd3'
df
SPEED GEAR
0 10 d1
1 20 d1
2 30 d2
3 40 d2
4 50 d3

hana_tab = "T_GEAR"
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, pandas_df = df,
table_name = hana_tab,
force = True,
replace = False)

df_remote_new = df_remote.select (['SPEED','GEAR',("""CASE WHEN SPEED <=20 THEN 'g1'
WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
WHEN SPEED >40 THEN 'g3'
END""",'GEAR_HANA')])

df_remote_new.collect()

SPEED GEAR GEAR_HANA
0 10 d1 g1
1 20 d1 g1
2 30 d2 g2
3 40 d2 g2
4 50 d3 g3

 

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,205

Great to see pivoting from pandas dataframe code towards the HANA dataframe approach! 🙂 Excellent!

AndreasForster
Product and Topic Expert
Product and Topic Expert
12,204

In one of the open office hours a question came up on creating new features out of the existing columns, without data extraction.

Here is an example, that bundles Finland, Sweden and Norway for example into Scandinavia in a new column

df_remote = conn.table('EMPLOYEE_CHURN_ORG')
df_rem_prepped = df_remote.select(['EMPLOYEE_ID', 'AGE', \
("""CASE WHEN CURRENT_COUNTRY IN ('Sweden', 'Norway', 'Finland') THEN 'Scandinavia'
WHEN CURRENT_COUNTRY = 'UK' THEN 'United Kingdom'
WHEN CURRENT_COUNTRY IS NULL THEN 'Unknown'
ELSE CURRENT_COUNTRY END""", 'REGION_LOCAL')])
df_rem_prepped.select('REGION_LOCAL').distinct().collect()

0 Kudos
12,171

How to include in result columns from list ls_col =['EMPLOYEE_ID', 'AGE']  instead of writing directly in select?

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,168

You can make use of Python function strings like str.format() or f-strings  (python background here) to apply parameters with hana-ml / dataframe methods.

See a hana-ml example here (not exact match to ls_col)

id='####'
hdf_new=hdf_remote.filter('"uuid" = \'{}\''.format(id)).sort('date', desc=False)

 

12,165
ls_col =['SPEED', 'GEAR']
#remove [ and ] from statement
df_remote_new = df_remote.select (LS_COL,("""CASE WHEN SPEED <=20 THEN 'g1'
WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
WHEN SPEED >40 THEN 'g3'
END""",'GEAR_HANA'))

# include all fields
df_remote_new = df_remote.select ('*',("""CASE WHEN SPEED <=20 THEN 'g1'
WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
WHEN SPEED >40 THEN 'g3'
END""",'GEAR_HANA'))

 

Sergiu
Contributor
0 Kudos
12,097

HANA Cloud - Predicitive Analysis Library Hands On

from hana_ml.visualizers.dataset_report import DatasetReportBuilder
datasetReportBuilder = DatasetReportBuilder()
datasetReportBuilder.build(df_remote, key="PRODUCT_ID")

ERROR:
File ~\Anaconda3\envs\cln_ml_hana\lib\site-packages\hana_ml\visualizers\dataset_report.py:480, in DataAnalyzer.get_numeric_variable_distribution_data(self, column, bins) 478 if bins > 1: 479 bins = bins - 1 --> 480 _, _, bin_data = distribution_plot(data=data_, 481 column=column, 482 bins=bins) 483 x_data = list(bin_data['BANDING']) 484 y_data = list(bin_data['COUNT']) File ~\Anaconda3\envs\cln_ml_hana\lib\site-packages\hana_ml\visualizers\eda_plotly.py:138, in distribution_plot(data, column, bins, title, x_axis_label, y_axis_label, x_axis_fontsize, x_axis_rotation, debrief, rounding_precision, replacena, fig, subplot_pos, **kwargs) 136 bin_data = bin_data.append({"BANDING": item, "COUNT": 0}, ignore_index=True) 137 bin_data.sort_values(by="BANDING", inplace=True) --> 138 trace = go.Bar(x=bin_data['BANDING'], y=bin_data['COUNT']) 139 if fig: 140 fig.add_trace(trace, row=subplot_pos[0], col=subplot_pos[1]) NameError: name 'go' is not defined

Please help with a solution for the above error.

Thanks. Sergiu.

 

fazam
Product and Topic Expert
Product and Topic Expert
12,071

My suggestion would be to make sure plotly is working properly. Try adding the following code line

import plotly.graph_objs as go

Also, check to make sure that ipywidgets is installed.

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,011

Hi Sergiu, we've just released a new hana-ml fix version on pypi to address the plotly dependency issue observed.

Christoph

0 Kudos
11,999

#| top_k_attributions : int, optional
#| Specifies the number of features with highest attributions to output
>>> result = uc_rdt.predict(df_predict, key = "ID", top_k_attributions=10)

Please help to interpret when top_k_attributions = 1 OR top_k_attributions = 10 calculates same CONFIDENCE and outputs same number of features in shapley_explainer.

 

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
11,994

The confidence is the overall predicted classification decision confidence (over all features). As we may have a large number of input features (e.g. >100), we want to restrict the output the feature values and their relative contribution to the local predicted classification decision to the most important ones, default the top 10. if you top_k_attributions=1 you would only output the single top most important feature and its relative importance.

Example Predict with reason_code and top_k_attribution for the PredictiveQuality Tutorial example


# Explore test data-subset predictions applying the trained model
features = hdf_test.columns
features.remove('QUALITY')
features.remove('PRODUCT_ID')

# Using the predict-method with our model object hgbc
pred_res = hgbc.predict(hdf_test.head(1000), key='PRODUCT_ID', features=features, impute=True, top_k_attributions=3 )

# Review the predicted results
pred_res.select('PRODUCT_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE' ).head(3).collect()

 

0 Kudos
11,994

I suppose Confusion Matrix from UnifiedReport(model).build().display()  is build from d_train (seen data) and only for part of d_train. Confusion Matrix total  = 20% * d_train.count().  How to include all d_train or increase the %?

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
11,987

Referring to our tutorial code example about predictive quality

print('Size of val subset: ' + str(hdf_val.count()))
Size of val subset: 12500

 If I now print the confusion matrix after the fit

display(hgbc.confusion_matrix_.collect())
ACTUAL_CLASS | PREDICTED_CLASS| COUNT
0
0
9298
0
1
704
1
0
887
1
1
1611

That's exactly the 12500 we passed in to the fit as the validation subset, along with the 100k rows passed in to train the model.

0 Kudos
11,739

PAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.ipynb 

I couldn't find fit.hgbc cell with with  validation set hdf_val.

Please update the notebook.

 

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
11,457

If you download the notebook, in the hgbc.fit call, we input the union of hdf_val and hdf_train and with partition_method specify we've predefined the val-partition used during fit-validation. The fit method in unified classification, validates on pre-defined validation subset or can use part of the training-hdf.

hgbc.fit(data=df_trainval, 

         key='PRODUCT_ID',

         ...

       partition_method='user_defined', purpose='TRAIN_VAL_INDICATOR')

Sergiu
Contributor
0 Kudos
12,087

Please help with examples regarding how to extract features contribution from UnifiedClassification('HybridGradientBoostingTree') .

In other models set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) can be used.

Thanks. Sergiu.

ChristophMorgen
Product and Topic Expert
Product and Topic Expert
12,082

The UnifiedClassification predict method, supports attribution_method (Saabas or Tree-SHAP) and top_k_attributions to output with the predict-call. Here you can find an example

# Using the predict-method with our model object hgbc
pred_res = hgbc.predict(df_test.head(1000), key='uuid', features=features, impute=True )

# Review the predicted results
pred_res.select('uuid', 'SCORE', 'CONFIDENCE', 'REASON_CODE', 
                ('json_query("REASON_CODE", \'$[0].attr\')', 'Top1'), 
                ('json_query("REASON_CODE", \'$[0].pct\')', 'PCT_1') ).head(3).collect()