
SAP Datasphere is subset of the new SAP Business Data Cloud offering, which aims to enables every data professional to deliver seamless and scalable access to mission-critical business data.
Datasphere offers a host of connectors to all kinds of sources, SAP and non-SAP alike. One of the most versatile is the "Generic OData" connection, which allows consumption of a wide range of sources.
In this blog, we will see how we can utilize the Data Flow feature to consume complex OData responses.
Note: This is a rather niche guide to the Script Operator in Data Flows. If you were looking for more high-level information, I would recommend the official SAP Datasphere Documents, or the Datasphere tag to browse blog posts.
The Generic OData connector in Datasphere can't "flatten" a nest structure, and will end up storing nested structures as strings. The script operator in Data Flows allows developers to handle complex OData entities with Python, but can be an cumbersome to say the least.
Your best option is to consider some sort of middleware or integration platform between the source and Datasphere, that better allows you to handle OData entities better.
If you want to follow along, you will need:
I want to be transparent going into this: I don't think this is a very good or elegant solution.
In an ideal world, I would prefer some sort of integration platform, CIP or something else, between the OData service and Datasphere to correctly unpack and transform the data.
If you don't have that, you may consider the following.
In your development space, you can set up a connection to the services.
OData.org exposes both V2 and V4 OData services.
Clicking on the service will provide you with the service URL:
In Datasphere, create a Generic OData Connection:
Add the /Suppliers to a new Data Flow and preview the data.
Note: You may get an error, that the source tables is not "Deployed".
If you do, deploy the table and preview the data again.
If you try to consume the service in a service like postman, or just through a Python request, you will get the following JSON response:
You can see that there are two dictionaries here. Properties, which contain the keys (1) ID, (2) Name, (3) Address, and (4) Concurrency, and then another under Address, which contains Street, City, State, ZipCode and Country.
There are many ways to interpret this data, but the easiest would be to unpack it into the following table:
ID | Name | Address_Street | Address_City | Address_State | Address_ZipCode | Country | Concurrency |
0 | Exotic Liquids | NE 228th | Sammamish | WA | 98074 | USA | 0 |
As you can see, Address is unpacked into it's four columns, and we get a lovely, simple table.
However, the table deployed in Step 2, in Datasphere looks like this:
Which is an absolute nightmare. Extracting the data in SQL views (graphical or otherwise) will be substring-hell, as there are no JSON interpreters in the HANA SQLScript.
Alright, now let's transform the data with a Python Script.
Firstly, add a script to your canvas and connect the source.
The schema of the script operator is not dynamic. That means you will need to manually add the required columns to the operator:
The absolute 'funniest' quirk is that the output of the script is based on the order of the columns, not the names / keys.
That means, if your resulting DataFrame has this schema:
ID | Name | Address_Street | Address_City | Address_State | Address_ZipCode | Country | Concurrency |
The columns must be in the same order.
Otherwise the value of ID will be written to the "State" column.
Fun stuff 🙂
Drag and drop until your columns are in the required order:
Note: This problem may very well be solved in the future, or it may be working-as-designed and always just be a massive pain in the donkey.
Alrighty, so now we have the setup in place, we can begin to create a Python script to solve our problem.
The Script, in simple terms, have to do two things:
However, the Python Environment in Datasphere is highly restricted and does not allow for additional imports.
At the time of writing, the following libraries are supported:
You can see the full list of supported and disabled features in the documentation.
The base code editor looks like this:
As your may notice, the script takes the form of a function - Transform - that takes one variable: data. The source data will be read just like you see it in the Data Preview, as a Pandas DataFrame called 'data'.
A section has been marked in two strings of #, which is where you will add your code.
Note: At the time of writing, the editor has no code-completion, any sort of help or utilities, and does not allow you to place breakpoints or write an output to the console - or anywhere.
This makes debugging extremely tedious.
The JSON data is stored in Datasphere as a string, and we have to transform that without the JSON module, which would make the exercise trivial.
So, will do the following:
To accomplish step 1, we will use regular expressions that can identify the pairs. The following formats are supported:
That is, whether your data uses ' or " is okay.
r'["\'](\w+)["\']\s*:\s*(\{.*?\}|["\'].*?["\']|\d+)'
Yes, I know it looks like a pocket-calculator having a stroke, but it works.
You can use the regex-tester to play around with it for yourself.
Python allows a function to be defined within another function, so we can begin to package our logic into a function.
def process_dict(dict_str):
# Assume the input string is a dictionary and parse it
key_val_pattern = r'["\'](\w+)["\']\s*:\s*(\{.*?\}|["\'].*?["\']|\d+)' # Matches key-value pairs, supporting ' and "
matches = re.findall(key_val_pattern, dict_str)
The Variable "matches" will contain a list of all the pairs in the input.
Now, we can loop through the matches and create our dictionary:
def process_dict(dict_str) -> dict:
# Assume the input string is a dictionary and parse it
key_val_pattern = r'["\'](\w+)["\']\s*:\s*(\{.*?\}|["\'].*?["\']|\d+)' # Matches key-value pairs, supporting ' and "
matches = re.findall(key_val_pattern, dict_str)
parsed_dict = {}
for key, value in matches:
# Loop through and sotres keys as lower case, to support case-sentitivity.
if value.startswith('{'): # If the value is a nested dictionary
# Recursively process the nested dictionary
parsed_dict[str(key).lower()] = process_dict(value)
elif value.startswith('"') or value.startswith("'"): # If the value is a string.
parsed_dict[str(key).lower()] = value.strip('"').strip("'") # Remove quotes (either ' or ")
return parsed_dict
Note in line 9, we which for a nested dictionary and recursively handle that.
If we pass this string
"{'ID': 0, 'Name': 'Exotic Liquids', 'Address': { 'Street': 'NE 228th', 'City': 'Sammamish', 'State': WA', 'ZipCode': 98074, 'Country': 'USA' }, 'Concurrency': 0 }"
We will get the following result:
{'id': 0, 'name': 'Exotic Liquids', 'address': { 'street': 'NE 228th', 'city': 'Sammamish', 'state': WA', 'zipcode': 98074, 'country': 'USA' }, 'concurrency': 0 }
Which we can then use to get the data we need.
Note: The full script also contains handling for a list of dictionaries.
Now that you have your data in a variable, you can manipulate the DataFrame.
There is an infinite way of implementing this, but the most readable is a new function that uses the one above, and extracts the required info column by column.
There are faster ways, but for the sake of the example, readability > speed.
def get_address_data(row, newcolumn : str):
address = row["Address"]
address_info = process_dict(address)
return_value = address_info["address"].get(newcolumn.lower(),'')
return return_value
data["State"] = data.apply(lambda row: get_address_data(row, "State"), axis = 1)
In line 7, we create a new column, State and pass the entire row, and a value we want out.
Python now iterates through the dataframe and applies the function.
The full script will then look like this.
def transform(data):
"""
This function body should contain all the desired transformations on incoming DataFrame. Permitted builtin functions
as well as permitted NumPy and Pandas objects and functions are available inside this function.
Permitted NumPy and Pandas objects and functions can be used with aliases 'np' and 'pd' respectively.
This function executes in a sandbox mode. Please refer the documentation for permitted objects and functions. Using
any restricted functions or objects would cause an internal exception and result in a pipeline failure.
Any code outside this function body will not be executed and inclusion of such code is discouraged.
:param data: Pandas DataFrame
:return: Pandas DataFrame
"""
#####################################################
def parse_string_to_list(data_str):
result_list = []
if ( data_str in ['','-','NULL',None] ):
return result_list
# Check if the string represents a list
if data_str.startswith('['):
# Extract individual elements of the list using regex
# Matches nested dictionaries or flat dictionaries
list_pattern = r'\{[^{}]*\{[^{}]*\}[^{}]*\}|\{[^{}]*\}'
list_items = re.findall(list_pattern, data_str)
else:
# If it's not a list, treat the whole string as a single dictionary
list_items = [data_str]
# Step 2: Process each item in the list (or the single string)
for item in list_items:
result_list.append(process_dict(item))
return result_list
def process_dict(dict_str) -> dict:
# Assume the input string is a dictionary and parse it
key_val_pattern = r'["\'](\w+)["\']\s*:\s*(\{.*?\}|["\'].*?["\']|\d+)' # Matches key-value pairs, supporting ' and "
matches = re.findall(key_val_pattern, dict_str)
parsed_dict = {}
for key, value in matches:
# Loop through and sotres keys as lower case, to support case-sentitivity.
if value.startswith('{'): # If the value is a nested dictionary
# Recursively process the nested dictionary
parsed_dict[str(key).lower()] = process_dict(value)
elif value.startswith('"') or value.startswith("'"): # If the value is a string.
parsed_dict[str(key).lower()] = value.strip('"').strip("'") # Remove quotes (either ' or ")
return parsed_dict
def get_address_data(row, newcolumn : str):
# Get the value of the "Address" column
address = row["Address"]
# Process the input and read the first element of the list
address_info = parse_string_to_list(address)[0]
# get the requested value, and return an empty-string
return_value = address_info["address"].get(newcolumn.lower(),'')
return return_value
data["State"] = data.apply(lambda row: get_address_data(row, "State"), axis = 1)
data["Zip"] = data.apply(lambda row: get_address_data(row, "zipcode"), axis = 1)
data["Street"] = data.apply(lambda row: get_address_data(row, "Street"), axis = 1)
data["Country"] = data.apply(lambda row: get_address_data(row, "Country"), axis = 1)
data["City"] = data.apply(lambda row: get_address_data(row, "City"), axis = 1)
#####################################################
return data
When you've added a target local table, you can run the data flow, and you should get the following result:
The support for nested OData entities, and python code, in Datasphere is lacking, and will hopefully be expanded greatly in the future.
Whether you have the option to directly change the OData source, or you are using a public service, you should really consider some sort of middleware between the source and Datasphere.
Until then, you do have the option to manually transform JSON, stored as strings, in Datasphere, but this can be a massive headache.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
10 | |
9 | |
7 | |
5 | |
4 | |
4 | |
3 | |
3 | |
2 | |
2 |