Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
shivamshukla12
Contributor
A small write-up on HANA ML dataframe , it is really a learning , an exposure and a knowledge sharing process to write something beautiful you learn along with your day to day job so holding the passion for technology in both of my hands here come’s my first post of 2020 and topic is interesting enough , everyone’s favorite. HANA Machine learning & it’s about dataframe this time.

Dataframe methods it’s meaning and their python implementation.

Let’s go step by step.

HANA ML Dataframe - A Skeleton for data

  • Represents a frame that is backed by a database SQL statement and can also be created by the table statement.

  • The SAP HANA dataframe , which provides a set of methods for analyzing the data in SAP HANA without bringing the data to the client.


This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.



DataFrame Methods : - 

Import dataframe library from hana_ml package.
import hana_ml
from hana_ml import dataframe ##import dataframe from HANA ML

Before diving deep into dataframes let's connect first to SAP HANA System and load some data into dataframe for manipulation , filtering & slicing etc.
from data_load_utils import DataSets , Settings
url , port , usr , pwd = Settings.load_config(r"C:\Users\abc\config\e2edata.ini")

 

ConnectionContext - This represents a connection to HANA System let's code this
connection_context = dataframe.ConnectionContext(url,port,usr,pwd) ##pass user id and password host and port for connection

connection_context.connection.isconnected() if this is connected you will True in console as output

 

Load Data -  Create dataframe from table if you are choosing the table method pass the table name of corresponding schema
dataset1 = connection_context.table("ADULT_DATA_FULL_TBL") ##Table Name of Schema 

##Another way of achieving it

dataset1 = connection_context.sql("SELECT * FROM ADULT_DATA_FULL_TBL") ##SQL Query for dataframe

this returns the dataframe which is just a skeleton , it's not having any data So most of the operations on the dataframes are optimized to execute inside SAP HANA & on the fly which result's in huge performance benefit.
dataset1.select_statement   ## This is select statement for table which we have used to create dataframe , Meanning is ..
Output
'SELECT * FROM "ADULT_DATA_FULL_TBL"'

 

Execute dataset1 jupyter cell and output is the hana ml dataframe in return it doesn't contain data as of now , we need to call the .collect() method which will further execute the select statement & give's result back to the client (in our case jupyter notebook or python environment)



count() -  Return the number of records in the dataframe.
dataset1.count()

Output
48822

collect() - This triggers the select statement on dataframe & fetches data from SAP HANA , this must be used when you think you need that data into python environment for some display , validate or any other kind of requirement , the design goal of this is to perform calculations on the fly and give back the result to python environment for further processing.

example:
dataset1.head(10).collect() ## Use Collect method to load data from sap hana into client ,  this will fetch first 10 records

head(10) means top 10 records

result :



columns -  List the number of columns in dataset
dataset1.columns

Ouput
['ID',
'AGE',
'WORKCLASS',
'FNLWGT',
'EDUCATION',
'EDUCATIONNUM',
'MARITALSTATUS',
'OCCUPATION',
'RELATIONSHIP',
'RACE',
'CAPITALGAIN',
'CAPITALLOSS',
'HOURSPERWEEK',
'NATIVECOUNTRY',
'INCOME']

 

Limit rows during display - just pass the lower & upper limit in the square bracket , this is just to get the overview of data like you wanna sample something.



Copy of dataframe into a new dataframe -  this is the case when you don't wanna change/alter the original dataframe.
dataset3 = dataset1.head(100).collect()

dataset3 ##it will display the 100 records copied from dataset1

 

Drop duplicates - drop duplicates from the dataset , this is required when you are processing the data in algorithms which just require the unique entries.

data cleansing
dataset4 = dataset1.drop_duplicates() 

##Check the select statement for dataset4

dataset4.select_statement
'SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'

When removing duplicates from the dataset , select statement on the fly selects only the unique records from the sap hana database table , hence your original table in sap hana database remains intact.

Drop Columns from the dataset -  Drop the columns which are not required this is used where some useless columns are removed so that algorithms performs better and results must be accurate in regression/classification.
dataset5 = dataset1.drop(["AGE"])

dataset5.select_statement #####There is no column AGE here ,this is also called slicing of data

'SELECT "ID", "WORKCLASS", "FNLWGT", "EDUCATION", "EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK", "NATIVECOUNTRY", "INCOME" FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'

 

So every time you perform an operation on dataset using any method , it prepare's corresponding select statement & execute it on the fly & give's result back to the client.

Select TOP 10 records from dataset but for two columns only - 

Select statement on dataframe contains only two columns and selecting only 10 records once the collect method called.
dataset6 = dataset1.select('ID' , 'WORKCLASS').head(10)# , 'AGE' , 'HOURSPERWEEK' , 'RACE').head(10).collect()

dataset6.select_statement ##Dynamic select statement
'SELECT TOP 10 * FROM (SELECT "ID", "WORKCLASS" FROM (SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6") dt'

dataset6.collect()

ID WORKCLASS
0 28 Private
1 75 Private
2 196 Private
3 264 Private
4 303 Private
5 324 Private
6 390 Private
7 442 Private
8 447 Private
9 472 Private

 

NAs in Dataset - Check for Null values in dataset , this is required before processing the data into any algorithm , drop the null or put any average value into it.
dataset1.hasna('CAPITALGAIN') ##Check if there is any NULL value in CAPITALGAIN column of dataset it will
return true or false

Output
True

Fill NAs with some random value

dataset1 = dataset1.fillna(10 , ["CAPITALGAIN"])
dataset1.hasna('CAPITALGAIN')

Output
False

 

Operation on dataframe and do some manipulation on fields - 

A new column of TWICE_AGE has been added with a multiplication of 2 in original age.
dsp = dataset1.select('ID' , 'AGE'  , ('"AGE"*2' , "TWICE_AGE"))     ##Age Column has been added with multiplication

dsp.head(5).collect()

Ouput -
ID AGE TWICE_AGE
0 1 71 142
1 2 44 88
2 3 39 78
3 4 27 54
4 5 44 88

 

Checkout the select statement of above dataframe (dsp)
dsp.select_statement

Output

'SELECT "ID", "AGE", "AGE"*2 AS "TWICE_AGE" FROM
(SELECT "ID", "AGE", "WORKCLASS", "FNLWGT", "EDUCATION",
"EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION",
"RELATIONSHIP", "RACE", COALESCE("CAPITALGAIN", 10)
AS "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK",
"NATIVECOUNTRY", "INCOME"
FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") dt) AS "DT_18"'

 

Data Filtration methods - 
dataset4.head(5).collect()  ## Select 5 Sample records from the dataset 

Output

ID AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS OCCUPATION RELATIONSHIP RACE GENDER CAPITALGAIN CAPITALLOSS HOURSPERWEEK NATIVECOUNTRY INCOME
0 41959 39 Private 286789 Doctorate 16 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
1 33040 22 Private 349212 Some-college 10 Never-married Other-service Own-child White Female 0 0 20 United-States <=50K
2 46142 53 Private 191389 HS-grad 9 Divorced Exec-managerial Not-in-family White Female 0 0 40 United-States <=50K
3 1319 37 Private 162424 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 45 United-States <=50K
4 35823 56 Private 200316 Some-college 10 Married-civ-spouse Adm-clerical Husband White Male 0 0

 

Query - We want those records which are having AGE greater than 60 
dataset_age = dataset4.filter('AGE > 60') ## records AGE greater than 60

dataset_age.collect()
Output:

dataset_age.count()
Output:
3606

dataset_age.select_statement
Output:
'SELECT * FROM (SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_FULL_TBL")
AS "DT_6") AS "DT_20" WHERE AGE > 60'

Output : 3906 records displayed.



Add Multiple conditions for multiple columns - 
dataset4_res = dataset4.filter('AGE > 60 AND EDUCATIONNUM = 8')
dataset4_res.select_statement

Output:
'SELECT * FROM (SELECT DISTINCT * FROM
(SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6")
AS "DT_20"
WHERE AGE > 70 AND EDUCATIONNUM = 8'

Here multiple conditions have been added and see the
where clause for this dynamic select statement

dataset4_res.count()

Output:
7

Only 7 records

Age > 70 & Educationnum = 8


## Suppose we want those records where EDUCATIONNUM = 9 

dataset4.filter('EDUCATIONNUM = 9').head(5).collect()


## Suppose we want those records where AGE = 27

dataset4.filter('AGE = 27').head(5).collect()

 

Select Statement for Filter : - Filter condition + Selecting three columns  , let's see what select statement get's build dynamically

Python Script -
df = connection_context.table('ADULT_DATA_FULL_TBL', schema='DM_PAL').filter('AGE>70').select('ID', 'AGE' , 'WORKCLASS')

df.select_statement
Output:
'SELECT "ID", "AGE", "WORKCLASS" FROM
(SELECT * FROM (SELECT * FROM "DM_PAL"."ADULT_DATA_FULL_TBL")
AS "DT_16" WHERE AGE>70) AS "DT_17"'

See the dynamically prepared select statement .

df.collect()
Output:
ID AGE WORKCLASS
0 9526 90 Private
1 29221 90 Private
2 15578 71 Private
3 25061 74 Private
4 6922 76 Private
... ... ... ...
863 35161 71 Self-emp-not-inc
864 35964 75 Local-gov
865 37604 72 ?
866 39221 73 Self-emp-not-inc
867 48214 74 Private
868 rows × 3 columns

 

Multiple filter conditions on Pandas Dataframe - 
dataset3[(dataset3['AGE'] > 27) & ( dataset3['OCCUPATION'] == 'Craft-repair')]

if you have a pandas dataframe at client and you wanna filter records at client end only then above filter syntax will suffice.

Output :



 

Slicing of Dataframe - 

Slice few columns from dataset
## Copy the original dataset into a new one 
new_dataset1 = dataset1.head(5).collect()

new_dataset1[['ID' , 'AGE']]

Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35

Copy sliced data into third one
new_dataset2 = new_dataset1[['ID' , 'AGE']]

new_dataset2
Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35

Create dataframe for few columns only - 
dataset_id_age = connection_context.sql('SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL')

dataset_id_age.collect()

Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35
5 324 24
6 390 34
7 442 49
8 447 35
9 472 42

dataset_id_age.select_statement
'SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL'

In Short dataframe is a select statement which just
wait to trigger unless it is called

 

describe() -  Describe the dataframe with all the stats information
dataset4.describe().collect() 



empty() - Returns True if this DataFrame has 0 rows
dataset2.empty()

Output:
False

has() - Returns true if a column is in the DataFrame.
dataset1.columns

Output:
['ID',
'AGE',
'WORKCLASS',
'FNLWGT', 'EDUCATION','EDUCATIONNUM',
'MARITALSTATUS',
'OCCUPATION',
'RELATIONSHIP',
'RACE',
'GENDER',
'CAPITALGAIN',
'CAPITALLOSS',
'HOURSPERWEEK',
'NATIVECOUNTRY',
'INCOME']


dataset1.has('ID')

Output:
True

 

join() - Joining of two dataset based on the condition like ID = ID
condition = '{}."ID"={}."ID"'.format(dataset4.quoted_name, dataset2.quoted_name)

## Display Condition

condition

Output:
'"DT_7"."ID"="DT_3"."ID"'

## Let's join dataset4 and dataset2 and copy the result into a new dataset7
dataset7 = dataset4.join(dataset2, condition)

dataset7.head(5).collect()
Output:

dataset7 



save dataframe save() - 
dataset4.save('#SAVEDATA')

## retrieve the saved dataframe into a new dataframe
new_dataset4 = connection_context.table('#SAVEDATA')

##Fetch first 10 records and display
new_dataset4.head(10).collect()

Ouput:

10 Records displayed


new_dataset4.select_statement ### See here data is coming from temp database table or may be a pointer to the original table

'SELECT * FROM "#SAVEDATA"'

 

AGGREGATE FUNCTIONS - one of the most important topic under the dataframe methods , calculating average age , or summing up the prices , counting the items sold , such kind of requirements resulted inventing these kind of functions , let's code and see how they work.
## Suppose we need to find out the average age of employees working in a specific department

agg_list = [('AVG' , 'AGE' , 'AVG_AGE')] ## Aggr. funtion , column and New column

dataset_avg_age = dataset4.agg(agg_list=agg_list , group_by='WORKCLASS').collect()

## Display the results

dataset_avg_age

Group by WORKCLASS and Average Age of employees are displayed here.


## Count the number of people of same age

dataset_age_count = dataset4.agg(agg_list=[('COUNT' , 'ID' , 'ID_COUNT')] , group_by='AGE').collect()

dataset_age_count

Output:

from IPython.display import HTML
HTML(dataset_age_count.head(5).to_html())


## Maximum AGE of employees in each group

dataset4.agg([('max', 'AGE', 'MAX_AGE')], group_by='WORKCLASS').collect()

Output:


## MIN Age of employees in each workclass

dataset4.agg([('min', 'AGE', 'MIN_AGE')], group_by='WORKCLASS').collect()

Output:



 

corr() -  finding correlation between columns of a dataset
dataset4.corr('ID' , 'AGE').collect()



 

Create dataframe from pandas - An SAP HANA DataFrame that contains the data in the pandas_df.
import pandas as pd

d = {'col1': [1, 2], 'col2': [3, 4]}

df = pd.DataFrame(data=d)

df_new = hana_ml.dataframe.create_dataframe_from_pandas(connection_context, df, 'pd_df', force=False, replace=True)

df ##Pandas dataframe

Output:
col1 col2
0 1 3
1 2 4

df_new.collect() ##HANA Ml dataframe

Output:
col1 col2
0 1 3
1 2 4

 

Rename column:
dataset1.columns
Output:
['ID',
'AGE',
'WORKCLASS',
'FNLWGT',
'EDUCATION',
'EDUCATIONNUM',
'MARITALSTATUS',
'OCCUPATION',
'RELATIONSHIP',
'RACE',
'GENDER',
'CAPITALGAIN',
'CAPITALLOSS',
'HOURSPERWEEK',
'NATIVECOUNTRY',
'INCOME']

Rename Column ID :
dataset1 = dataset1.rename_columns({'ID' :'ID_NEW'})

dataset1.columns
Output:
['ID_NEW',
'AGE',
'WORKCLASS',
'FNLWGT',
'EDUCATION',
'EDUCATIONNUM',
'MARITALSTATUS',
'OCCUPATION',
'RELATIONSHIP',
'RACE',
'GENDER',
'CAPITALGAIN',
'CAPITALLOSS',
'HOURSPERWEEK',
'NATIVECOUNTRY',
'INCOME']

Dataframe API - https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.04/en-US/hana_ml.dataframe.html

I have covered almost all the methods in dataframe but still few are left to explore , i will keep sharing other methods in my next blog.

 

Thank you for reading 🙂

 

PS : I am continuously updating the blog by adding the new methods & any new learning 🙂
7 Comments