Introduction

Minhee · ‎12-01-2022

Introduction

It is important to analyze the usage of pipelines run in SAP Data Intelligence (DI) Modeler. In this regard, DI Monitoring already helps you to access some of the relevant information such as graph status or runtime analysis. However, not all graph statistics are there. Have you been wondering which pipelines were run most frequently, who is using which data sources, or even, which data sources were frequently used together?

In this blog post, you will see how to extract and store such information with DI. This information will then be stored in SAP Data Warehouse Cloud (DWC). To top things off, you can visualize our data with the help of SAP Analytics Cloud (SAC), making it easier to get an overview of the pipeline details.

This is an internship project and I’d like to share what I have learned through the journey, specifically, how to use data pipelines, write custom Generation 2 operators, connect DI to DWC, and create SAC stories.

After reading the post, the following questions will be answered:

Graph statistics including runtime status, user distribution, and time series analysis

Which graphs are run the most

Which data sources are used the most

Who is using which data sources

Which data sources are used in each graph

Which data sources are frequently used together

Challenges

One of the biggest challenges is to retrieve pipelines of all users and their graph details such as graph data sources. To do so, I suggest that we use DI REST API. Especially, The Pipeline Engine API allows us to access details of data processing pipelines and facilitates the automation of operational tasks such as creating, starting, and monitoring pipelines.

For this project, I created custom operators (Gen2 & Python-based) that gather graph information from the REST API and store the data in DWC via a pipeline. After that, the resulting datasets were smoothly visualized using SAC.

Prerequisites

As we create DI custom gen2 operators, basic knowledge of gen2 operators and how to create a DI custom operator is necessary. Moreover, your DI should be connected to DWC to store the outputs of a pipeline. Lastly, for visualization, DWC and SAC should be connected.

If you are not yet familiar with the above, here are the resources that can help:

Introduction of DI Generation 2 Operators: Blog post

How to create a DI custom operator: Blog post

How to connect DI to DWC: Tutorial

How to connect DWC to SAC: Tutorial

Extract and Store Pipeline Details in DI

Below is the pipeline created in DI.

Let’s break down the pipeline into 3 parts.

1. Get pipelines details of all users using REST APIs

First, a custom operator will fetch the user list of a DI tenant and their pipeline specifics including data sources applied in each pipeline.

Here are the custom operators created using REST API:

Get Users

In order to retrieve graphs for all users, not just my own, a complete user list is necessary. There are multiple ways to get the user list, such as using a vctl command or checking DI System Management. However, for simplicity, this operator was built with DI API endpoint ‘/auth/v2/user’. Although it is not an officially published endpoint, it serves our purpose in the project when you do not already have a user list at hand.

Here is a snippet of my script:

# get all users

def get_users(connection):

    restapi = "/auth/v2/user"

    url = connection['url'] + restapi

    headers = {'X-Requested-With': 'XMLHttpRequest'}

    r = requests.get(url, headers=headers, auth=connection['auth'], verify=True)



    users = r.json()



    return users

Get Runtime Graphs

This operator retrieves executed pipeline details such as graph name, status, and substitutions. To do so, API endpoint ‘/v1/runtime/graphsquery’ was utilized. The response was filtered in order to only obtain the details of ‘completed’ graphs.

Check out the filter applied to this code snippet:

# get graphs of a user

def get_graphs(connection, user):

    restapi = "/app/pipeline-modeler/service/v1/runtime/graphsquery"

    url = connection['url'] + restapi

    headers = {'x-requested-with': 'fetch', 'x-datahub-user': user}

    payload = {"filter": ["equal", "status", "completed"]}

    r = requests.post(url, headers=headers, auth=connection['auth'], verify=True, data=json.dumps(payload))

...

Get Graph Data sources

Some built-in DI operators such as the ones below require target data sources:

Read File: com.sap.file.read

Write File: com.sap.file.write

Binary File Producer: com.sap.file.write.v2

SAP HANA Client: com.sap.hana.client2

Read HANA Table: com.sap.hana.readTable, (gen2) com.sap.hana.readTable.v2

Write HANA Table: com.sap.hana.writeTable, (gen2) com.sap.hana.writeTable.v2

Table Consumer: com.sap.database.table.consumer.v2, (gen2) com.sap.database.table.consumer.v3

Table Producer: com.sap.database.table.producer.v2, (gen2) com.sap.database.table.producer.v3

Structured File Consumer: com.sap.storage.consumer.v2, (gen2) com.sap.storage.consumer.v3

Structured File Producer: com.sap.storage.producer.v2, (gen2) com.sap.storage.producer.v3

Therefore, this custom operator was built to fetch information about target data sources set in the above operators in a pipeline. This information contains not only the specifics on data sources and DI operators, but also whether the data sources were served as an input source or an output source. DI API endpoint ‘/v1/runtime/graphdescriptions/{handle}’ provided such information.

The code snippet below shows the list of DI operators that require target data sources and fetches the data source path:

# get target data sources set in operators

def get_datasources(graph_description, handle, substitutions):

...

        try:

            if component in ['com.sap.file.read', 'com.sap.file.write', 'com.sap.file.write.v2'] and \

                    'dynamicConnection' not in config and 'path' in config:

                if 'connectionID' not in config['connection']:

                    connection_id = 'LOCAL'

                else:

                    connection_id = config['connection']

                path = config['path']

                direction = 'R' if component == 'com.sap.file.read' else "W"

            elif component in ['com.sap.hana.client2', 'com.sap.hana.readTable', 'com.sap.hana.writeTable',

                               'com.sap.hana.readTable.v2', 'com.sap.hana.writeTable.v2']:

                connection_id = config['connection']['connectionID']

                path = config['tableName']

                if component in ['com.sap.hana.readTable', 'com.sap.hana.readTable.v2']:

                    direction = 'R'

                elif component in ['com.sap.hana.writeTable.v2', 'com.sap.hana.writeTable']:

                    direction = 'W'

                else:

                    direction = 'U'

            elif component in ['com.sap.database.table.consumer.v2', 'com.sap.database.table.producer.v2',

                               'com.sap.database.table.consumer.v3', 'com.sap.database.table.producer.v3',

                               'com.sap.storage.consumer.v2', 'com.sap.storage.producer.v2',

                               'com.sap.storage.consumer.v3', 'com.sap.storage.producer.v3'] and config['service']:

                connection_id = config['serviceConnection']['connectionID']

                path = config['source']['remoteObjectReference']['qualifiedName']

                service = config['service']

                if component in ['com.sap.database.table.consumer.v2', 'com.sap.database.table.consumer.v3',

                                 'com.sap.storage.consumer.v2', 'com.sap.storage.consumer.v3']:

                    direction = 'R'

                else:

                    direction = 'W'

            else:

                continue

...

2. Get data source pairs frequently used together

Next, another custom operator was built to generate pairs of data sources that were frequently used together in each pipeline. Furthermore, each pair’s appearance was counted to check their frequency.

For this, the following script was applied:

# get pairs of data sources frequently used together

def get_datasource_pairs(df):

    pairs = list()



    pairs_df = df.loc[:, ['handle', 'direction', 'data_source']]

    pairs_df['direction_data_source'] = '[' + pairs_df['direction'] + ']' + ' ' + pairs_df['data_source']

    pairs_df = pairs_df.drop(['direction', 'data_source'], axis=1)



    pairs_df = pairs_df.groupby(['handle'])['direction_data_source'].apply(list).reset_index(

        name='direction_data_source')

    path_list = pairs_df['direction_data_source'].to_list()



    for path in path_list:

        pairs.append(findsubsets(path, 2))



    pairs = flatten(pairs)



    if pairs:

        pairs_df = pd.DataFrame(pairs)

        pairs_df.columns = ['data_source1', 'data_source2']

        pairs_df = pairs_df.groupby(['data_source1', 'data_source2']).size().reset_index(name='count')

    else:

        pairs_df = pd.DataFrame({'data_source1':'no pairs', 'data_source2':'no pairs', 'count':0}, index=[0])



    return pairs_df

3. Store the datasets in DWC: Table Producer

Finally, the two resulting datasets are stored in DWC with Table Producers. This step makes our transition to data visualization handy if your DWC is connected to SAC.

Consume DWC Dataset in SAC

Now views can be created from the datasets stored in DWC from the DI pipeline. To make views consumable in SAC, make sure that its semantic usage is ‘Analytical Dataset’ and ‘Exposure for Consumption’ is on. Once views from each dataset are saved and deployed, a story can be built in SAC.

SAC Story Example

Visualization helps to understand complicated data more easily. Here I made an SAC story with the two datasets stored in DWC.

Below is the example of my story, but of course, how you want to graphically present your datasets is completely up to you.

Conclusion

This blog post suggested how to retrieve pipeline usage data in SAP Data Intelligence and visualize them in an SAP Analytics Cloud story. I hope it gave you an idea of how to extract and analyze pipeline usage details. Also, I would like to underscore the flexibility of working with DI. You can easily customize the graph information you would like to retrieve and make your own usage statistics. The entire script of my project is in the GitHub repository as a reference.

On a personal note, I feel grateful to have had hands-on experience with SAP products and to have taken part in this exciting project. Special thanks go to christian.sengstock and thorsten.hapke who generously provided support and feedback during the course of the project.

Thank you for reading and please free to share your thoughts in a comment!

Analyzing SAP Data Intelligence Pipeline Usage: From REST API to SAC

Introduction

Challenges

Prerequisites

Extract and Store Pipeline Details in DI

1. Get pipelines details of all users using REST APIs

Get Users

Get Runtime Graphs

Get Graph Data sources

2. Get data source pairs frequently used together

3. Store the datasets in DWC: Table Producer

Consume DWC Dataset in SAC

SAC Story Example

Conclusion

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win