Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
P_Plazi
Product and Topic Expert
Product and Topic Expert
4,126
Welcome to our blog on the SAP Data Warehouse Cloud Script Operator! As of the current release, the Script Operator is available to use and combines the usability of the commonly known python libraries Pandas and NumPy together with capabilities of SAP Data Warehouse Cloud to create Data Flows and individual views of information. It is a versatile operator that can be used for a wide range of tasks, such as data cleansing, data transformation, and many more. In this blog, we will be diving into the various capabilities by providing examples and best practices for SAP Data Warehouse Cloud projects.

The content of this Blogpost originates from a cooperation with niga0001 and s.morio.

Note! You can jump to any task in this post from the Agenda below. Each one works as a standalone solution to fulfil the underlying task.

Task 0 - Possibilities of the SAP Data Warehouse Cloud Script Operator


Task 1 - Classify numerical data as bins


Task 2 - Datatype conversion from Datetime to String


Task 3 - Extraction of Substrings


Task 4 - Generating unique Hash Keys


Task 5 - Unpivot Data


Task 6 - Filter with regular expressions



Task 0 - Possibilities of the SAP Data Warehouse Cloud Script Operator


Supported functions in Script Operator


The Script Operator comes with an interface to use python through a SAP Data Warehouse Cloud Data Flow. the included libraries are restricted to Pandas, NumPy and several built-in module operations. Within these libraries, there are some restrictions as shown in the operator's documentation. Data given through a data flow is passed into the Script Operator in form of a pandas data frame. Thus said, it is possible to reach each entry through its Column Keys. When using one of the libraries given, there is no prefix necessary - in the form of pd.dataframe - to reach a function within. For example, in case of simple adding 10 to every data entry, there is a variable extracted by entering data.add(10) into the operator's console. If you are new to SAP Data Warehouse Cloud or the Script Operator, the video below can help you find the code editor for your script operations.



The Script Operator


The SAP Data Warehouse Cloud Script Operator allows to write code into a dedicated python editor while allowing to delete or add columns in a panel below. Keep in mind that every calculated column populated through the script needs to be included manually in design time of the underlying data flow. As of the current release, there is no possibility to implement new columns in run time.

Every script has the same structure beginning with a function definition for the transformation of the passed data as a data frame. To exit the script and save the results, you need to use a return statement including the data frame to pass through the data flow. Like every SAP Data Warehouse Cloud data flow, you need a Source Table and a Target Table. After deploying the Target Table there is an option to decide whether data should be appended, truncated, or deleted. In case there is a script overwriting existing values due to data transformation, append will work the right way since it is only able to append new data to the data frame. If the intend is to overwrite existing data, make sure to choose truncate to delete the information in a first step and refill the data entries with new values. The Script below shows the basic structure of a python script within SAP Data Warehouse Cloud.
def transformation(data):

# Fill in your scripts here with data as a source
# and save the values into data as output

return data

Limitations through SAP Data Warehouse Cloud


As of today, SAP Data Warehouse Cloud has some major limitations when working with the script operator. Functions that require the dataset as a whole cannot be processed due to the batching in SAP Data Warehouse Cloud. Datasets have a maximum batch size of less than 100.000 entries and use an ideal batch size for the system. The description of this feature describes it as only using a manual number for batch size in case the manual data entry is less than the system's number. Therefore, it is not possible to simply drop duplicates by using the given function from pandas. In case there is a cut between the batches with a duplicate appearing once in different batches, they will not be detected as duplicates and the process ends up still having these duplicates. The preferred approach for this task is to use SQL-Script.

Another issue with SAP Data Warehouse Cloud is the requirement to have all Column Keys passed before executing the script. It is not possible to create new columns dynamically while executing the script as described before. One of the approaches affected by this limitation is the pivot functionality. There are several workarounds to think of, for example, you want to pivot from a Year and a Value Column, a simple CSV-File created in MS Excel might work using the [=SEQUENCE(row, column, start, end)] function in Excel. Cross joining this table with the Source Table results in a data frame containing every column necessary.

The Importance of the Script Operator


The Script Operator finds some good use in preprocessing operations on data and is used for separate data manipulation in case SQL is not applicable or the user is more familiar with python. Instead of doing several steps for ordering data and reformatting data before running a SQL-Script, the script operator does so all by itself in one approach.

Advanced Python on SAP Data Warehouse Cloud Models


In case you are interested in using your data with several libraries available in Python or want to work even with the Pandas functionalities that are not applicable due to the limitations, please check out this blog about Connecting to SAP Data Warehouse Cloud from Python! It includes a whole documentation of steps and libraries necessary to build a connection to your Space.

Task 1 - Classify numerical data as bins


The idea of this task is to group several values with given limits. Let's say we take a survey and have the users' ages included. Now we try to analyze the survey by age groups. To simplify the case, there will be three groups, 0-12 for kids, 13-19 for teenagers and 20+ for adults. There might be many more, but in this case, there is a classification for every group by the age of their members.

Specifications


The applicable python function for this approach is pandas.cut. By determining the bin sizes needed with specific limits and labels for each bin, the classification of values might yet begin. The cut operator takes the column to classify, the lower-bound values for the bins and the same number of labels for clarification. The Target Table is set to Mode: append for this scenario since there is a new column created. Make sure to include the new Calculated Column to the data flow before deploying the Target Table. There is no way to include new columns if forgotten beforehand. If a calculated column of type String is created, pass and empty String to the column.

Note! In case of a missing column in the Target Table, a new one needs to be created and deployed before executing the Data Flow.

The Hands-On Approach


Given a Dataset with Sales Orders including net amount, the need is to classify several deals into groups by declaring them as small, mid, big or huge deal sizes. For doing so, the idea is to have bin limits of 0-4999, 5000-9999, and so on. The script below can be used for doing so, applied on a column named 'NETAMOUNT'. The calculated column to populate with labels is names 'Deal Type'.
def transform(data):

bins = [0, 5000, 10000,50000,100000]
labels = ['small deal', 'mid-size deal', 'big deal', 'huge deal']
data['Deal Type'] = pd.cut(data['NETAMOUNT'], bins=bins, labels=labels)

return data

After passing the data to our newly created column, there is a single column for our solution with String entries for every deal size. The following image is the section of our column showing the result.



Recap


The steps to create bins are easy for reproduction in many use cases:

  1. Decide what the bins should look like and determine them

  2. Create all the necessary labels for the bins to be declared

  3. Create a column to save the created values in - Make sure that the data type of the columns fits your entries in the labels variable passed

  4. Create a function to pass the data produced by cut into the column prior created

  5. Return the solution and use the append mode in the target table to write data


Note! In case you run the script several times and change the output for the calculated column, switch the Target Table to truncate for being able to overwrite existing values.

Task 2 - Datatype conversion from Datetime to String


In this Task, we want to tackle a commonly known problem about datatypes. In some cases, data is passed in wrong format and we need to transform the given data. For newly uploaded files, there is an option to preprocess data and transform datatypes, but not for yet existent tables. Thus said, python has the ability to transform a whole column of our table from one type to another, in this approach from datetime to String. It is also possible to use this approach to pass only necessary variables of a Date like year or pass it in a given formatting. Not only is it possible to convert the datatype, but it is also possible to convert a Datetime to a Unix-Coding with the right script available.

Specifications


In this scenario, there are several functions depending on the direction our approach should work. Always keep in mind that in our data frame, the data is delivered as a series of values. In case of our Datetime to String approach, we need to use the function dt.strftime. To specify how our function returns a string from time, we need to clarify the structure and the parameters returned as explained in the documentation. This might be helpful not only to return a String from our Datetime but also to declare a formatting for our Timestamps. For this approach, we set the Target Table to Mode: Append and create a calculated column with type String.

The Hands-On Approach


To pass data from our Datetime column to a string column, we take the newly created calculated column to enter our values into. There are many options how to transform our Timestamp in order to pass only the year, a combination of year and month or whatever values our Timestamp includes. The following script takes our 'CHANGEDAT' column and returns a string in format 'Year-Month-Day'.
def transform(data):

data['CHANGEDATString'] = data['CHANGEDAT'].dt.strftime('%Y-%m-%d')

return data

The time format used can vary from one country to another or even be dependent from a formatting rule given by a system. By transforming it from Datetime to String, we can make sure to simply fit the given formatting by deciding whether you want to separate year and month with - or / and by the ordering of year, month and day. The extract of the target table below shows the output from the given script above.



Recap


To transform a Datetime into a string, there are only a few steps necessary as follows:

  1. Choose the datetime column to transform

  2. Create a new calculated column as String

  3. Run the script above and decide for a formatting rule to apply

  4. Set the Target Table to append mode or on truncate of you want to rerun the script with another formatting


Task 3 - Extraction of Substrings


In this task, the SAP_SC_SALES_BusinessPartners table is used. In the Web Address column, we want to remove the http://www. part of the address and some space. This can be done in using the vectorized string functions of pandas.

Specifications


For reducing the length of a String within a table, the str-Function can be applied to take the substring of the web address beginning at a specific character's position in the given String. For writing the output into the given table, the Target Table is set to Mode: Truncate and there are no additional columns needed to save any values within.

The Hands-On Approach


Given the Dataset including the 'WEBADDRESS' column with data entries of String, we will take the substring of the web address beginning at the 12th character up to the end of the string. To do so, we will use the functionality as shown in the script below.
def transform(data):

data['WEBADDRESS'] = data['WEBADDRESS'].str[11:]

return data

As shown below, the transformation of the column 'WEBADDRESS' as passed into the same column again.


Before transformation


 


After transformation



Recap


For reproducing this Task, the following guideline breaks it down:

  1. Find the column to extract the substring from and make sure it is passed as String

  2. Find the position to start the substring extraction at

  3. Create a function to overwrite the column with the str[n:] function for position n+1

  4. Return the solution and use truncate mode within the Target Table to overwrite the values


Task 4 - Generating unique Hash Keys


In some cases, it can also be important to generate hash keys for each row of the data frame. Pandas provides an out of the box function to do so. It is a pandas utilities function named hash_pandas_object.

Specifications


To include a hash key to a table in SAP Data Warehouse Cloud, we use the has_pandas_object function as described above. Thus said, we have to keep in mind, that the output of this function is of type uint64 for every element of the series. To save the values in our table, we need to create a calculated column of type int64 so it matches the output of the pandas function. Since the calculated column is a newly created one and we do not attempt to overwrite given values, the Target Table can be set to Mode: Append.

The Hands-On Approach


Given the Dataset, we can simply use the following single line of code to create the hash keys for every row to populate the 'Hashkey' column.
def transform(data):

data['Hashkey'] = pd.util.hash_pandas_object(data)

return data

Recap


Since there is not much done in this script operation, there are only a few steps to reproduce this task:

  1. Create a new calculated column of type int64

  2. Create a function to populate the 'Hashkey' column prior created

  3. Return the solution and use append mode within the Target Table to fill the column with values


Task 5 - Unpivot Data


Often times the data is not in the right structure. Let’s take the data set ‘GDP_per_capita’ from the Gross Domestic Product (GDP) data product as an example. Here, all GDP values for each year are listed in different columns. However, we’d like to have one column for all values and one column for all years. To change the structure of the table we can use the pandas function melt. The function has three parameters which we will use, the columns to be used as identifier variables, columns to unpivot and name for the value column.

Specifications


To unpivot the data set given, we will use the melt function from the pandas library. In this task, we need to create not only one but two new columns named 'YEAR' and 'VALUE'. The Dataset exists of many Columns named as the single years containing fact data. This can be done by including a projection before script execution within the data flow like seen in the image below.



The Hands-On Approach


The melt function contains of separate variables for identifier variables, the name to use for the variable name and the value name to enter the fact data into. Keep in mind, that the data types of the calculated columns have to match the ones used in the script operator. In this case, both year and value is set to String and the outputs are cast to String as seen in the following script. For returning the data into our new columns, we need to create them beforehand and the Target Table can be set to Mode: Append for the data entries. For not returning every single Column after Script execution, make sure to not show them as a result of the script. To do so, go into the Panel below the script editor and hide them from the data flow.
def transform(data):

data = data.melt(id_vars=['Country Code', 'Indicator Code',
'Indicator Name', 'Country Name'],
var_name = 'YEAR', value_name = 'VALUE')
data['VALUE'] = data['VALUE'].astype(str)
data['YEAR'] = data['YEAR'].astype(str)

return data

The following to data extract of the data set shows the table after using melt through the script operator. The given Dataset 'GDP_PER_CAPITA' can be downloaded from the data marketplace in order to reproduce the solution by yourself.



Recap


To apply the unpivot functionality within the script operator using melt, there are several steps necessary as following:

  1. Get knowledge about what the data is representing to create two calculated columns

  2. Create the calculated columns for the Key-Value pairs and in case you are not sure what type to choose, pick String and cast the values later on

  3. After creating the Keys and Values as calculated columns, use the script operator

  4. As id variables choose all the columns not used as keys or values

  5. In case you need the string type, cast the values with astype(str)

  6. Go into the Panel below the script editor and remove the columns you do not want to show anymore

  7. Set the Target Table to append mode for data entries and run the whole data flow after deploying the table


Task 6 - Filter with regular expressions


Suppose that by some error in the system, the Customer ID for some customers was not correctly assigned. Now we want to filter out all the incorrect IDs using regular expressions. To do so, we use the regular expression and the method str.contains. This method checks for every row in our data frame if the Customer ID equals the regular expression (Regex). As the naming convention for the Customer ID defines, that it starts with 4 letters followed by an underline character, another character and two digits. This can be defined in our regex as .{4}_.[0-9][0-9].

Specifications


The contains function in our scenario will check for a specific pattern in the regex format and this will be handed over as a String. The idea behind this task is to filter the input values by only taking into account Customer IDs being in the correct format. Therefore it is necessary to set the Target Table on Mode: Truncate to make sure other values will be dropped. Since no new columns are created here, there is no further step necessary.

The Hands-On Approach


As already described above, we need a pattern to compare our 'CUSTOMERID' column to. In case the values fit the given pattern in our scenario, these value are passed to the target table, if not, they are dropped. The following script can be used for the described case.
def transform(data):

pattern = ".{4}_.[0-9][0-9]"
applies_to_nc = data['CUSTOMERID'].str.contains(pattern)
data = data[applies_to_nc]

return data

The following extract of the source table shows how the data looks like before executing the script above. After executing the script, values like and 1000153 are no longer part of the Table and the rows including these values are not passed over to the target table.

Recap


For filtering with regular expressions and a given pattern, there are only a few steps necessary:

  1. Decide which column you need to compare to a given pattern

  2. Create a correct pattern to apply to the data

  3. Use the str.contains function for the data frame column using the defined pattern

  4. Return the values to the target table using the truncate mode to overwrite existing values


Conclusion


As described in this blog post, there are some interesting use cases where we can apply the Script Operator. One important step to use the methodology above is to always keep in mind what restrictions and limitations are existent. As we described before, the most common limitations are due to batching and dynamic column creation.

Both these limitations are SAP Data Warehouse Cloud limitations. In case you want to overcome these limitations, try to connect the SAP Data Warehouse Cloud data with a Python environment as described in Task 0. By building a connection from a Python Notebook to a SAP Data Warehouse Cloud Space and you have rights to use HANA ML, it is also possible to work with Machine Learning algorithms in order analyze data.

If there are any further questions about the script operator in SAP Data Warehouse Cloud, feel free to ask questions in the comment section.
5 Comments