2022 Nov 17 7:57 AM - edited 2022 Nov 30 7:53 AM
Prepare yourself (Prerequisites blogpost) and watch our kick-off call or read our kick-off blogpost.
During the challenge you will get the opportunity working hands-on with SAP HANA Cloud’s machine learning capabilities 🤖🎯.
Besides our open office hours, this thread is the place to ask questions and share your experiences with other machine learning enthusiasts. The challenge team is happy to answer any question you may have in the comments below.
Will you accept the challenge?
2022 Nov 28 3:34 PM
Hi,
You’ve signed up for ML Community challenge "I quit! - How to prevent employee churn” kick-off on November 28th, that’s fantastic!
Review this blog for preparation details: Prerequisites for the SAP HANA Cloud Machine Learning Challenge “I quit!” – How to prevent employee .... As you will need an SAP HANA Cloud instance (Free tier or trial are not compatible with the challenge contents), and if you don’t have an adequate set-up yourselves, please reach out to the team SAPHANACloud@sap.com to get access to a suitable instance.
Looking forward to see you joining our challenge!
Best regards,
Christoph
2022 Nov 29 7:29 AM
Join our open office hour calls, starting today
During the challenge you can join our experts to get your open questions answered and address any issues you are dealing with regarding the challenge. Please find the slots offered below:
Or simply ask you questions here in the forum!
2022 Nov 30 2:35 AM - edited 2022 Nov 30 2:36 AM
Hi,
I would need to check the data uploaded onto the HANA Cloud database. How can I access SAP HANA Cloud Cockpit or SQL console (SAP HANA Data Explorer) for the following instance? Thanks!
8c7eed2e-f460-4a82-abfa-e3ede36923d7.hna1.prod-eu10.hanacloud.ondemand.com
2022 Nov 30 7:19 AM - edited 2022 Nov 30 7:22 AM
Hi, we haven't set up full BTP applications like HANA Cloud Database Explorer available on our side for the participants. However, if you have a local HANA Cockpit / Database Explorer, you can simply establish a SQL Connection to the HANA Cloud instance from there. Alternatively, you can use a tools like DBeaver locally installed and connect (see blog). Certainly from the python side, once the data is uploaded, and you have the dataframe created, you can use dataframe methods to access and check the data, e.g. mydataframe.head(10).collect() would issue a Select top 10 * from <table>. you can see some method examples in Yannick's blog at "How can we explore our data and react to data quality issues early".
Christoph
2022 Nov 30 10:25 AM
If you have a BTP account, e.g. in BTP Trial or a Free Tier, then could you try if generic DBX opens for you at https://hana-cockpit.cfapps.us10.hana.ondemand.com/hrtt/sap/hana/cst/catalog/index.html?
Alternatively, you can use DBeaver -- an open-source IDE for databases: https://blogs.sap.com/2020/04/20/secure-connection-from-dbeaver-to-sap-hana-cloud/
Regards,
-Witalij
2022 Dec 01 7:50 AM - edited 2022 Dec 01 7:50 AM
As another helper to get everyone started, find another employee churn data upload and exploration using HANA dataframes notebook in our challenge repository
2022 Dec 01 11:54 AM - edited 2022 Dec 01 11:56 AM
Hi.
In pandas dataframe data can be modified with inplace = True or by assigning results to the same dataframe, example df_data = df_data.replace(to_replace = np.nan, value = 'MSV').
In hana dataframe COLLECT allows to show data, but the inplace = True is no supported and assigning modified data back to same dataframe is not allowed, example data_hdf = data_hdf.fillna('MSV').collect() is not supported.
Does that mean that data has to be modified in pandas and updated to hana wtih dataframe.create_dataframe_from_pandas?
COLLECT shows data. What are the methods that allow to modify hana dataframe? Please recommend notebooks with examples on how to modify hana dataframe.
Thanks. Sergiu.
2022 Dec 01 12:21 PM
HANA dataframe does not store data, as it is with the Pandas or R dataframe. I had it explained for example here: https://blogs.sap.com/2020/11/10/multi-model-in-hana_ml-2.6-for-python-dataframes/
HANA dataframe in Python represents a SELECT statement to be executed in SAP HANA database and it is executed only once a method like `collect()` is called.
There is a method `save()` that creates a table or a db view for data returned by a HANA dataframe to the database :
2022 Dec 01 2:06 PM - edited 2022 Dec 01 3:10 PM
Pandas dataframe:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df['C] = 1
df['C'] = df['C] + 2
hana-ml Tutorial - Dataframe:
d_train_new = d_train.add_constant('C',1)
I couldn't find in tutorial how to modify a column, tried as in pandas but I got an error.
d_train_new['C'] = d_train_new['C'] + 1
TypeError: __getitem__ argument not understood: 'C'
Please help with examples on how to modify hana dataframe column.
2022 Dec 01 3:59 PM
you can simply use the select dataframe method to create new columns
df.select('*', ('B * 4', 'D')).collect()
2022 Dec 01 12:08 PM
In presented notebooks in hana model.fit the parameter key is mandatory.
In Python machine libraries key parameter is not necessary.
What is the role of the key hana libraries?
MODEL_SEARCH.fit(data=d_train, key= ID,
label=target,
partition_method='stratified',
partition_random_state=1,
stratified_column=target,
build_report=False)
Thanks. Sergiu.
2022 Dec 01 12:53 PM
Hi Sergiu, key is actually not mandatory with fitting a model. However if the input dataframe to a fit-call contains and Id column, you either have to drop it from the dataframe beforehand, or identify the column as key / id column so it's not treated as a feature column. With the predict / score methods e,g. of unified classification, an id-column is required.
Hope this explains.
Christoph
2022 Dec 02 5:20 PM - edited 2022 Dec 04 4:38 PM
Estimate car price - Load data to SAP HANA
Pandas:
# df - pandas dataframe
df_data.insert(0, 'CAR_ID', df_data.reset_index().index)
OR
df_data[COL_CAR_ID] = df_data.index
Please help with examples on how to copy in hana_ml.dataframe index to column.
Thanks. Sergiu.
2022 Dec 04 4:36 PM - edited 2022 Dec 04 4:39 PM
2022 Dec 05 7:23 AM
2022 Dec 05 10:52 AM
Today we share additional data, which was discovered from our data engineer. The additional data for example includes if an employee recently updated their LinkedIn Profile or took part in an HR training. During the open office calls this week on Tuesday/Thursday (see this post for timings), we will also demo how to use Unified Classification with PAL Hybrid Gradient Boosting (see example) and the new PAL AutoML-Classification feature (see example, blog, community call introduction, more automl examples), as two additional approaches to tackle the employ churn challenge and achieve best results …
Have fun working on the challenge!
2022 Dec 06 7:55 AM
Join us in 5 minutes for a short demo and to address your open questions in our open office hour.
2022 Dec 06 8:06 AM
2022 Dec 06 8:08 AM
Hi Sergiu, Please try this link: https://sap-se.zoom.us/j/96194777489?pwd=NmVIRHlFcUEvT3RtS3VhdTgySU5pZz09
2022 Dec 08 9:06 AM
Today we have our second open office session (see this post for timings), have you already include the additonal data to improve your model? Maybe HRTRAINING, SICKDAYS, ... adds additional insights, you should try it out!
2022 Dec 05 3:23 PM - edited 2022 Dec 05 3:29 PM
hana_ml.dataframe - replace allows to replace values in the same column from a dictionary.
Examples
>>> df.collect()
Aa Bb Cb
0 0 5 a
1 10 0 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace(to_replace={'Aa': {0: 100, 4: 400}}).collect()
Aa Bb Cb
0 100 5 a
1 10 0 b
2 2 7 c
3 3 8 d
4 400 9 e
Pandas
>>> df = pd.DataFrame([10,20,30,40,50], columns =['speed'])
>>> df.loc[(df["speed"] <= 20), "gear"] = 'd1'
>>> df.loc[(df["speed"] > 20) & (df["speed"] <= 40), "gear"] = 'd2'
>>> df.loc[(df["speed"] > 40), "gear"] = 'd3'
>>> df
speed gear
0 10 d1
1 20 d1
2 30 d2
3 40 d2
4 50 d3
Please help with examples for ml_hana.dataframe on how to create an additional column with values based on conditions of other columns as in pandas.
Thanks. Sergiu.
2022 Dec 06 7:49 AM - edited 2022 Dec 06 7:51 AM
hana_ml.dataframe.DataFrame - replace allows to replace values from same column from a dictionary.
Examples -------- >>> df.collect() Aa Bb Cb 0 0 5 a 1 10 0 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace(to_replace={'Aa': {0: 100, 4: 400}}).collect() Aa Bb Cb 0 100 5 a 1 10 0 b 2 2 7 c 3 3 8 d 4 400 9 e
Pandas
>>>df = pd.DataFrame([10,20,30,40,50], columns =['speed'])
>>>df.loc[(df["speed"] <= 20), "gear"] = 'd1'
>>>df.loc[(df["speed"] > 20) & (df["speed"] <= 40), "gear"] = 'd2'
>>>df.loc[(df["speed"] > 40), "gear"] = 'd3'
>>>df
speed gear
0 10 d1
1 20 d1
2 30 d2
3 40 d2
4 50 d3
Please provide examples for hana_ml.dataframe.DataFrame on how to create conditional column based on values of another column.
Thanks. Sergiu.
2022 Dec 06 11:06 AM - edited 2022 Dec 06 12:26 PM
df = pd.DataFrame([10,20,30,40,50], columns =['SPEED'])
df.loc[(df["SPEED"] <= 20), "GEAR"] = 'd1'
df.loc[(df["SPEED"] > 20) & (df["SPEED"] <= 40), "GEAR"] = 'd2'
df.loc[(df["SPEED"] > 40), "GEAR"] = 'd3'
df
SPEED GEAR
0 10 d1
1 20 d1
2 30 d2
3 40 d2
4 50 d3
hana_tab = "T_GEAR"
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, pandas_df = df,
table_name = hana_tab,
force = True,
replace = False)
df_remote_new = df_remote.select (['SPEED','GEAR',("""CASE WHEN SPEED <=20 THEN 'g1'
WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
WHEN SPEED >40 THEN 'g3'
END""",'GEAR_HANA')])
df_remote_new.collect()
SPEED GEAR GEAR_HANA
0 10 d1 g1
1 20 d1 g1
2 30 d2 g2
3 40 d2 g2
4 50 d3 g3
2022 Dec 06 11:14 AM
Great to see pivoting from pandas dataframe code towards the HANA dataframe approach! 🙂 Excellent!
2022 Dec 06 11:18 AM
In one of the open office hours a question came up on creating new features out of the existing columns, without data extraction.
Here is an example, that bundles Finland, Sweden and Norway for example into Scandinavia in a new column
df_remote = conn.table('EMPLOYEE_CHURN_ORG')
df_rem_prepped = df_remote.select(['EMPLOYEE_ID', 'AGE', \
("""CASE WHEN CURRENT_COUNTRY IN ('Sweden', 'Norway', 'Finland') THEN 'Scandinavia'
WHEN CURRENT_COUNTRY = 'UK' THEN 'United Kingdom'
WHEN CURRENT_COUNTRY IS NULL THEN 'Unknown'
ELSE CURRENT_COUNTRY END""", 'REGION_LOCAL')])
df_rem_prepped.select('REGION_LOCAL').distinct().collect()
2022 Dec 06 11:41 AM - edited 2022 Dec 06 11:41 AM
How to include in result columns from list ls_col =['EMPLOYEE_ID', 'AGE'] instead of writing directly in select?
2022 Dec 06 12:00 PM
You can make use of Python function strings like str.format() or f-strings (python background here) to apply parameters with hana-ml / dataframe methods.
See a hana-ml example here (not exact match to ls_col)
id='####' hdf_new=hdf_remote.filter('"uuid" = \'{}\''.format(id)).sort('date', desc=False)
2022 Dec 06 12:04 PM - edited 2022 Dec 06 12:24 PM
ls_col =['SPEED', 'GEAR']
#remove [ and ] from statement
df_remote_new = df_remote.select (LS_COL,("""CASE WHEN SPEED <=20 THEN 'g1'
WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
WHEN SPEED >40 THEN 'g3'
END""",'GEAR_HANA'))
# include all fields
df_remote_new = df_remote.select ('*',("""CASE WHEN SPEED <=20 THEN 'g1'
WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
WHEN SPEED >40 THEN 'g3'
END""",'GEAR_HANA'))
2022 Dec 07 1:03 PM - edited 2022 Dec 07 2:45 PM
HANA Cloud - Predicitive Analysis Library Hands On
from hana_ml.visualizers.dataset_report import DatasetReportBuilder
datasetReportBuilder = DatasetReportBuilder()
datasetReportBuilder.build(df_remote, key="PRODUCT_ID")
ERROR:
File ~\Anaconda3\envs\cln_ml_hana\lib\site-packages\hana_ml\visualizers\dataset_report.py:480, in DataAnalyzer.get_numeric_variable_distribution_data(self, column, bins) 478 if bins > 1: 479 bins = bins - 1 --> 480 _, _, bin_data = distribution_plot(data=data_, 481 column=column, 482 bins=bins) 483 x_data = list(bin_data['BANDING']) 484 y_data = list(bin_data['COUNT']) File ~\Anaconda3\envs\cln_ml_hana\lib\site-packages\hana_ml\visualizers\eda_plotly.py:138, in distribution_plot(data, column, bins, title, x_axis_label, y_axis_label, x_axis_fontsize, x_axis_rotation, debrief, rounding_precision, replacena, fig, subplot_pos, **kwargs) 136 bin_data = bin_data.append({"BANDING": item, "COUNT": 0}, ignore_index=True) 137 bin_data.sort_values(by="BANDING", inplace=True) --> 138 trace = go.Bar(x=bin_data['BANDING'], y=bin_data['COUNT']) 139 if fig: 140 fig.add_trace(trace, row=subplot_pos[0], col=subplot_pos[1]) NameError: name 'go' is not defined
Please help with a solution for the above error.
Thanks. Sergiu.
2022 Dec 07 3:47 PM
My suggestion would be to make sure plotly is working properly. Try adding the following code line
import plotly.graph_objs as go
Also, check to make sure that ipywidgets is installed.
2022 Dec 08 9:22 AM
Hi Sergiu, we've just released a new hana-ml fix version on pypi to address the plotly dependency issue observed.
Christoph
2022 Dec 08 10:53 AM - edited 2022 Dec 08 10:54 AM
#| top_k_attributions : int, optional
#| Specifies the number of features with highest attributions to output
>>> result = uc_rdt.predict(df_predict, key = "ID", top_k_attributions=10)
Please help to interpret when top_k_attributions = 1 OR top_k_attributions = 10 calculates same CONFIDENCE and outputs same number of features in shapley_explainer.
2022 Dec 08 11:29 AM
The confidence is the overall predicted classification decision confidence (over all features). As we may have a large number of input features (e.g. >100), we want to restrict the output the feature values and their relative contribution to the local predicted classification decision to the most important ones, default the top 10. if you top_k_attributions=1 you would only output the single top most important feature and its relative importance.
Example Predict with reason_code and top_k_attribution for the PredictiveQuality Tutorial example
# Explore test data-subset predictions applying the trained model
features = hdf_test.columns
features.remove('QUALITY')
features.remove('PRODUCT_ID')# Using the predict-method with our model object hgbc
pred_res = hgbc.predict(hdf_test.head(1000), key='PRODUCT_ID', features=features, impute=True, top_k_attributions=3 )# Review the predicted results
pred_res.select('PRODUCT_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE' ).head(3).collect()
2022 Dec 08 11:29 AM - edited 2022 Dec 08 11:33 AM
I suppose Confusion Matrix from UnifiedReport(model).build().display() is build from d_train (seen data) and only for part of d_train. Confusion Matrix total = 20% * d_train.count(). How to include all d_train or increase the %?
2022 Dec 08 12:44 PM - edited 2022 Dec 08 12:44 PM
Referring to our tutorial code example about predictive quality
print('Size of val subset: ' + str(hdf_val.count()))
Size of val subset: 12500
If I now print the confusion matrix after the fit
display(hgbc.confusion_matrix_.collect())
ACTUAL_CLASS | PREDICTED_CLASS| COUNT
0 | 0 | 9298 |
0 | 1 | 704 |
1 | 0 | 887 |
1 | 1 | 1611 |
That's exactly the 12500 we passed in to the fit as the validation subset, along with the 100k rows passed in to train the model.
2022 Dec 10 9:31 AM
PAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.ipynb
I couldn't find fit.hgbc cell with with validation set hdf_val.
Please update the notebook.
2022 Dec 12 8:35 AM
If you download the notebook, in the hgbc.fit call, we input the union of hdf_val and hdf_train and with partition_method specify we've predefined the val-partition used during fit-validation. The fit method in unified classification, validates on pre-defined validation subset or can use part of the training-hdf.
hgbc.fit(data=df_trainval,
key='PRODUCT_ID',
...
partition_method='user_defined', purpose='TRAIN_VAL_INDICATOR')
2022 Dec 07 2:24 PM - edited 2022 Dec 07 2:46 PM
Please help with examples regarding how to extract features contribution from UnifiedClassification('HybridGradientBoostingTree') .
In other models set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) can be used.
Thanks. Sergiu.
2022 Dec 07 3:09 PM - edited 2022 Dec 07 3:11 PM
The UnifiedClassification predict method, supports attribution_method (Saabas or Tree-SHAP) and top_k_attributions to output with the predict-call. Here you can find an example
# Using the predict-method with our model object hgbc pred_res = hgbc.predict(df_test.head(1000), key='uuid', features=features, impute=True ) # Review the predicted results pred_res.select('uuid', 'SCORE', 'CONFIDENCE', 'REASON_CODE', ('json_query("REASON_CODE", \'$[0].attr\')', 'Top1'), ('json_query("REASON_CODE", \'$[0].pct\')', 'PCT_1') ).head(3).collect()