Technology Blog Posts by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
clsorensen911
Participant
974

Introduction

SAP Datasphere is subset of the new SAP Business Data Cloud offering, which aims to enables every data professional to deliver seamless and scalable access to mission-critical business data.

Datasphere offers a host of connectors to all kinds of sources, SAP and non-SAP alike. One of the most versatile is the "Generic OData" connection, which allows consumption of a wide range of sources.  

In this blog, we will see how we can utilize the Data Flow feature to consume complex OData responses. 

Note: This is a rather niche guide to the Script Operator in Data Flows. If you were looking for more high-level information, I would recommend the official SAP Datasphere Documents, or the Datasphere tag to browse blog posts.

Summary

The Generic OData connector in Datasphere can't "flatten" a nest structure, and will end up storing nested structures as strings. The script operator in Data Flows allows developers to handle complex OData entities with Python, but can be an cumbersome to say the least.

Your best option is to consider some sort of middleware or integration platform between the source and Datasphere, that better allows you to handle OData entities better. 

Prerequisites 

If you want to follow along, you will need:

  1. Access to a Datasphere Tenant
  2. Access to a development space, and the correct authorizations.
  3. A basic idea of Datasphere objects: Data Flows and Local Tables 
  4. An OData source (I will use the OData.org reference services, so you should be able to reproduce the results)
  5. An understanding of the Python module Pandas, and dataframes
  6. A deep and seething hate for Regular Expressions, but still see their value. I also hate regular expressions.  

Disclaimer: I don't like this solution any more than you do

I want to be transparent going into this: I don't think this is a very good or elegant solution. 
In an ideal world, I would prefer some sort of integration platform, CIP or something else, between the OData service and Datasphere to correctly unpack and transform the data.

If you don't have that, you may consider the following. 

Step 1: Establish the connections

In your development space, you can set up a connection to the services.
OData.org exposes both V2 and V4 OData services.

clsorensen911_4-1745567846542.png

Clicking on the service will provide you with the service URL:

clsorensen911_5-1745567909646.png

In Datasphere, create a Generic OData Connection:

clsorensen911_6-1745567997899.png

Step 2: Consume /Suppliers in a Data Flow

Add the /Suppliers to a new Data Flow and preview the data.

clsorensen911_8-1745568141294.png

Note: You may get an error, that the source tables is not "Deployed". 
If you do, deploy the table and preview the data again.

Step 3 (Optional) : Understanding the problem

If you try to consume the service in a service like postman, or just through a Python request, you will get the following JSON response:

clsorensen911_9-1745569294082.png

You can see that there are two dictionaries here. Properties, which contain the keys (1) ID, (2) Name, (3) Address, and (4) Concurrency, and then another under Address, which contains Street, City, State, ZipCode and Country. 

There are many ways to interpret this data, but the easiest would be to unpack it into the following table:

IDNameAddress_StreetAddress_CityAddress_StateAddress_ZipCodeCountryConcurrency
0Exotic LiquidsNE 228thSammamishWA98074USA0

As you can see, Address is unpacked into it's four columns, and we get a lovely, simple table. 

However, the table deployed in Step 2, in Datasphere looks like this:

clsorensen911_10-1745569839164.png

Which is an absolute nightmare. Extracting the data in SQL views (graphical or otherwise) will be substring-hell, as there are no JSON interpreters in the HANA SQLScript.  

Step 4: Add a Script Operator

Alright, now let's transform the data with a Python Script.

Firstly, add a script to your canvas and connect the source.

clsorensen911_11-1745570218358.png

Step 5: Add the new columns to the script operator

The schema of the script operator is not dynamic. That means you will need to manually add the required columns to the operator:

clsorensen911_12-1745570438828.png

Step 6: Reorder the columns

The absolute 'funniest' quirk is that the output of the script is based on the order of the columns, not the names / keys.

That means, if your resulting DataFrame has this schema:

IDNameAddress_StreetAddress_CityAddress_StateAddress_ZipCodeCountryConcurrency

The columns must be in the same order.
Otherwise the value of ID will be written to the "State" column.
Fun stuff 🙂

Drag and drop until your columns are in the required order:

clsorensen911_13-1745570864565.png

Note: This problem may very well be solved in the future, or it may be working-as-designed and always just be a massive pain in the donkey.

Step 7: Let's code away!

Alrighty, so now we have the setup in place, we can begin to create a Python script to solve our problem. 

Script goals

The Script, in simple terms, have to do two things:

  1. Take the JSON string from the input, and convert it to a dictionary or a list of dictionaries. 
  2. Read the values from the result of step 1 and write them to the dataframe. 

Limitations

However, the Python Environment in Datasphere is highly restricted and does not allow for additional imports. 
At the time of writing, the following libraries are supported:

  • time
  • dateutil.parser
  • datetime
  • calendar
  • random
  • math
  • re
  • Pandas (except I/O, like read_csv, write csv, ect.)
  • Numpy (Except I/O like read_file, from_file ect.)
  • buildints (Except I/O like print, breakpoint, input, open, import, exec, eval, ect.)

You can see the full list of supported and disabled features in the documentation.

Step 7.1: The code editor

The base code editor looks like this:

clsorensen911_0-1745572477575.png

As your may notice, the script takes the form of a function - Transform - that takes one variable: data. The source data will be read just like you see it in the Data Preview, as a Pandas DataFrame called 'data'. 
A section has been marked in two strings of #, which is where you will add your code. 

Note: At the time of writing, the editor has no code-completion, any sort of help or utilities, and does not allow you to place breakpoints or write an output to the console - or anywhere. 
This makes debugging extremely tedious. 

Step 7.2: Script - Convert string to a Dictionary or List object.

The JSON data is stored in Datasphere as a string, and we have to transform that without the JSON module, which would make the exercise trivial. 

So, will do the following: 

  1. Find all the 'key' : 'value' pairs in the data - make sure we can also handle nested dictionaries. 
  2. Loop through the pairs and add them to a dictionary. 

To accomplish step 1, we will use regular expressions that can identify the pairs. The following formats are supported:

  • 'Key' : 'Value' (both with and without whitespaces around the : is supported)
  • "key" : "Value" (both with and without whitespaces around the : is supported)

That is, whether your data uses ' or " is okay. 

r'["\'](\w+)["\']\s*:\s*(\{.*?\}|["\'].*?["\']|\d+)'

Yes, I know it looks like a pocket-calculator having a stroke, but it works. 
You can use the regex-tester to play around with it for yourself.

Python allows a function to be defined within another function, so we can begin to package our logic into a function. 

def process_dict(dict_str):
    # Assume the input string is a dictionary and parse it
    key_val_pattern = r'["\'](\w+)["\']\s*:\s*(\{.*?\}|["\'].*?["\']|\d+)'  # Matches key-value pairs, supporting ' and "
    matches = re.findall(key_val_pattern, dict_str)

The Variable "matches" will contain a list of all the pairs in the input. 

Now, we can loop through the matches and create our dictionary:

def process_dict(dict_str) -> dict:
    # Assume the input string is a dictionary and parse it
    key_val_pattern = r'["\'](\w+)["\']\s*:\s*(\{.*?\}|["\'].*?["\']|\d+)'  # Matches key-value pairs, supporting ' and "
    matches = re.findall(key_val_pattern, dict_str)
    
    parsed_dict = {}
    for key, value in matches:
        # Loop through and sotres keys as lower case, to support case-sentitivity. 
        if value.startswith('{'):  # If the value is a nested dictionary
            # Recursively process the nested dictionary
            parsed_dict[str(key).lower()] = process_dict(value)
        elif value.startswith('"') or value.startswith("'"):  # If the value is a string. 
            parsed_dict[str(key).lower()] = value.strip('"').strip("'")  # Remove quotes (either ' or ")
    
    return parsed_dict

Note in line 9, we which for a nested dictionary and recursively handle that. 

If we pass this string

"{'ID': 0, 'Name': 'Exotic Liquids', 'Address': { 'Street': 'NE 228th', 'City': 'Sammamish', 'State':  WA', 'ZipCode': 98074, 'Country': 'USA' }, 'Concurrency': 0 }"

We will get the following result:

{'id': 0, 'name': 'Exotic Liquids', 'address': { 'street': 'NE 228th', 'city': 'Sammamish', 'state':  WA', 'zipcode': 98074, 'country': 'USA' }, 'concurrency': 0 }

Which we can then use to get the data we need. 

Note: The full script also contains handling for a list of dictionaries. 

Step 7.3: Add the data to the DataFrame

Now that you have your data in a variable, you can manipulate the DataFrame.

There is an infinite way of implementing this, but the most readable is a new function that uses the one above, and extracts the required info column by column. 
There are faster ways, but for the sake of the example, readability > speed. 

def get_address_data(row, newcolumn : str):
    address = row["Address"]
    address_info = process_dict(address)
    return_value = address_info["address"].get(newcolumn.lower(),'')
    return return_value

data["State"] = data.apply(lambda row: get_address_data(row, "State"), axis = 1)

In line 7, we create a new column, State and pass the entire row, and a value we want out. 

Python now iterates through the dataframe and applies the function. 

Step 8: The full script

The full script will then look like this.

def transform(data):
    """
    This function body should contain all the desired transformations on incoming DataFrame. Permitted builtin functions
    as well as permitted NumPy and Pandas objects and functions are available inside this function.
    Permitted NumPy and Pandas objects and functions can be used with aliases 'np' and 'pd' respectively.
    This function executes in a sandbox mode. Please refer the documentation for permitted objects and functions. Using
    any restricted functions or objects would cause an internal exception and result in a pipeline failure.
    Any code outside this function body will not be executed and inclusion of such code is discouraged.
    :param data: Pandas DataFrame
    :return: Pandas DataFrame
    """
    #####################################################
    def parse_string_to_list(data_str):
        result_list = []
        if ( data_str in ['','-','NULL',None] ):
            return result_list
        # Check if the string represents a list
        if data_str.startswith('['):
            # Extract individual elements of the list using regex
            # Matches nested dictionaries or flat dictionaries
            list_pattern = r'\{[^{}]*\{[^{}]*\}[^{}]*\}|\{[^{}]*\}'  
            list_items = re.findall(list_pattern, data_str)
        else:  
        # If it's not a list, treat the whole string as a single dictionary
            list_items = [data_str]

        # Step 2: Process each item in the list (or the single string)
        for item in list_items:
            result_list.append(process_dict(item))

        return result_list

    def process_dict(dict_str) -> dict:
        # Assume the input string is a dictionary and parse it
        key_val_pattern = r'["\'](\w+)["\']\s*:\s*(\{.*?\}|["\'].*?["\']|\d+)'  # Matches key-value pairs, supporting ' and "
        matches = re.findall(key_val_pattern, dict_str)
        
        parsed_dict = {}
        for key, value in matches:
            # Loop through and sotres keys as lower case, to support case-sentitivity. 
            if value.startswith('{'):  # If the value is a nested dictionary
                # Recursively process the nested dictionary
                parsed_dict[str(key).lower()] = process_dict(value)
            elif value.startswith('"') or value.startswith("'"):  # If the value is a string. 
                parsed_dict[str(key).lower()] = value.strip('"').strip("'")  # Remove quotes (either ' or ")
        
        return parsed_dict

    def get_address_data(row, newcolumn : str):
        # Get the value of the "Address" column
        address = row["Address"]
        # Process the input and read the first element of the list
        address_info = parse_string_to_list(address)[0]
        # get the requested value, and return an empty-string 
        return_value = address_info["address"].get(newcolumn.lower(),'')
        
        return return_value

    data["State"] = data.apply(lambda row: get_address_data(row, "State"), axis = 1)
    data["Zip"] = data.apply(lambda row: get_address_data(row, "zipcode"), axis = 1)
    data["Street"] = data.apply(lambda row: get_address_data(row, "Street"), axis = 1)
    data["Country"] = data.apply(lambda row: get_address_data(row, "Country"), axis = 1)
    data["City"] = data.apply(lambda row: get_address_data(row, "City"), axis = 1)
    #####################################################
    return data

Step 9: Add or select target table and run the Data Flow

When you've added a target local table, you can run the data flow, and you should get the following result:

clsorensen911_2-1745577716495.png

Conclusion

The support for nested OData entities, and python code, in Datasphere is lacking, and will hopefully be expanded greatly in the future. 
Whether you have the option to directly change the OData source, or you are using a public service, you should really consider some sort of middleware between the source and Datasphere. 

Until then, you do have the option to manually transform JSON, stored as strings, in Datasphere, but this can be a massive headache. 

 

Labels in this area