Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
Minhee
Advisor
Advisor
2,436

Introduction


It is important to analyze the usage of pipelines run in SAP Data Intelligence (DI) Modeler. In this regard, DI Monitoring already helps you to access some of the relevant information such as graph status or runtime analysis. However, not all graph statistics are there. Have you been wondering which pipelines were run most frequently, who is using which data sources, or even, which data sources were frequently used together? 

In this blog post, you will see how to extract and store such information with DI. This information will then be stored in SAP Data Warehouse Cloud (DWC). To top things off, you can visualize our data with the help of SAP Analytics Cloud (SAC), making it easier to get an overview of the pipeline details.

This is an internship project and I’d like to share what I have learned through the journey, specifically, how to use data pipelines, write custom Generation 2 operators, connect DI to DWC, and create SAC stories.

After reading the post, the following questions will be answered:

  • Graph statistics including runtime status, user distribution, and time series analysis

  • Which graphs are run the most

  • Which data sources are used the most

  • Who is using which data sources

  • Which data sources are used in each graph

  • Which data sources are frequently used together


 

Challenges


One of the biggest challenges is to retrieve pipelines of all users and their graph details such as graph data sources. To do so, I suggest that we use DI REST API. Especially, The Pipeline Engine API allows us to access details of data processing pipelines and facilitates the automation of operational tasks such as creating, starting, and monitoring pipelines.

For this project, I created custom operators (Gen2 & Python-based) that gather graph information from the REST API and store the data in DWC via a pipeline. After that, the resulting datasets were smoothly visualized using SAC.

 

Prerequisites


As we create DI custom gen2 operators, basic knowledge of gen2 operators and how to create a DI custom operator is necessary. Moreover, your DI should be connected to DWC to store the outputs of a pipeline. Lastly, for visualization, DWC and SAC should be connected.

If you are not yet familiar with the above, here are the resources that can help:

  • Introduction of DI Generation 2 Operators: Blog post

  • How to create a DI custom operator: Blog post

  • How to connect DI to DWC: Tutorial

  • How to connect DWC to SAC: Tutorial


 

Extract and Store Pipeline Details in DI


Below is the pipeline created in DI.


Let’s break down the pipeline into 3 parts.

 

1. Get pipelines details of all users using REST APIs


First, a custom operator will fetch the user list of a DI tenant and their pipeline specifics including data sources applied in each pipeline.

Here are the custom operators created using REST API:

Get Users


In order to retrieve graphs for all users, not just my own, a complete user list is necessary. There are multiple ways to get the user list, such as using a vctl command or checking DI System Management. However, for simplicity, this operator was built with DI API endpoint ‘/auth/v2/user’. Although it is not an officially published endpoint, it serves our purpose in the project when you do not already have a user list at hand.

Here is a snippet of my script:
# get all users
def get_users(connection):
restapi = "/auth/v2/user"
url = connection['url'] + restapi
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get(url, headers=headers, auth=connection['auth'], verify=True)

users = r.json()

return users

 

Get Runtime Graphs


This operator retrieves executed pipeline details such as graph name, status, and substitutions. To do so, API endpoint ‘/v1/runtime/graphsquery’ was utilized. The response was filtered in order to only obtain the details of ‘completed’ graphs.

Check out the filter applied to this code snippet:
# get graphs of a user
def get_graphs(connection, user):
restapi = "/app/pipeline-modeler/service/v1/runtime/graphsquery"
url = connection['url'] + restapi
headers = {'x-requested-with': 'fetch', 'x-datahub-user': user}
payload = {"filter": ["equal", "status", "completed"]}
r = requests.post(url, headers=headers, auth=connection['auth'], verify=True, data=json.dumps(payload))
...

 

Get Graph Data sources


Some built-in DI operators such as the ones below require target data sources:

  • Read File: com.sap.file.read

  • Write File: com.sap.file.write

  • Binary File Producer: com.sap.file.write.v2

  • SAP HANA Client: com.sap.hana.client2

  • Read HANA Table: com.sap.hana.readTable, (gen2) com.sap.hana.readTable.v2

  • Write HANA Table: com.sap.hana.writeTable, (gen2) com.sap.hana.writeTable.v2

  • Table Consumer: com.sap.database.table.consumer.v2, (gen2) com.sap.database.table.consumer.v3

  • Table Producer: com.sap.database.table.producer.v2, (gen2) com.sap.database.table.producer.v3

  • Structured File Consumer: com.sap.storage.consumer.v2, (gen2) com.sap.storage.consumer.v3

  • Structured File Producer: com.sap.storage.producer.v2, (gen2) com.sap.storage.producer.v3


Therefore, this custom operator was built to fetch information about target data sources set in the above operators in a pipeline. This information contains not only the specifics on data sources and DI operators, but also whether the data sources were served as an input source or an output source. DI API endpoint ‘/v1/runtime/graphdescriptions/{handle}’ provided such information.

The code snippet below shows the list of DI operators that require target data sources and fetches the data source path:
# get target data sources set in operators
def get_datasources(graph_description, handle, substitutions):
...
try:
if component in ['com.sap.file.read', 'com.sap.file.write', 'com.sap.file.write.v2'] and \
'dynamicConnection' not in config and 'path' in config:
if 'connectionID' not in config['connection']:
connection_id = 'LOCAL'
else:
connection_id = config['connection']
path = config['path']
direction = 'R' if component == 'com.sap.file.read' else "W"
elif component in ['com.sap.hana.client2', 'com.sap.hana.readTable', 'com.sap.hana.writeTable',
'com.sap.hana.readTable.v2', 'com.sap.hana.writeTable.v2']:
connection_id = config['connection']['connectionID']
path = config['tableName']
if component in ['com.sap.hana.readTable', 'com.sap.hana.readTable.v2']:
direction = 'R'
elif component in ['com.sap.hana.writeTable.v2', 'com.sap.hana.writeTable']:
direction = 'W'
else:
direction = 'U'
elif component in ['com.sap.database.table.consumer.v2', 'com.sap.database.table.producer.v2',
'com.sap.database.table.consumer.v3', 'com.sap.database.table.producer.v3',
'com.sap.storage.consumer.v2', 'com.sap.storage.producer.v2',
'com.sap.storage.consumer.v3', 'com.sap.storage.producer.v3'] and config['service']:
connection_id = config['serviceConnection']['connectionID']
path = config['source']['remoteObjectReference']['qualifiedName']
service = config['service']
if component in ['com.sap.database.table.consumer.v2', 'com.sap.database.table.consumer.v3',
'com.sap.storage.consumer.v2', 'com.sap.storage.consumer.v3']:
direction = 'R'
else:
direction = 'W'
else:
continue
...

 

2. Get data source pairs frequently used together


Next, another custom operator was built to generate pairs of data sources that were frequently used together in each pipeline. Furthermore, each pair’s appearance was counted to check their frequency.

For this, the following script was applied:
# get pairs of data sources frequently used together
def get_datasource_pairs(df):
pairs = list()

pairs_df = df.loc[:, ['handle', 'direction', 'data_source']]
pairs_df['direction_data_source'] = '[' + pairs_df['direction'] + ']' + ' ' + pairs_df['data_source']
pairs_df = pairs_df.drop(['direction', 'data_source'], axis=1)

pairs_df = pairs_df.groupby(['handle'])['direction_data_source'].apply(list).reset_index(
name='direction_data_source')
path_list = pairs_df['direction_data_source'].to_list()

for path in path_list:
pairs.append(findsubsets(path, 2))

pairs = flatten(pairs)

if pairs:
pairs_df = pd.DataFrame(pairs)
pairs_df.columns = ['data_source1', 'data_source2']
pairs_df = pairs_df.groupby(['data_source1', 'data_source2']).size().reset_index(name='count')
else:
pairs_df = pd.DataFrame({'data_source1':'no pairs', 'data_source2':'no pairs', 'count':0}, index=[0])

return pairs_df

 

3. Store the datasets in DWC: Table Producer


Finally, the two resulting datasets are stored in DWC with Table Producers. This step makes our transition to data visualization handy if your DWC is connected to SAC.


 

Consume DWC Dataset in SAC 


Now views can be created from the datasets stored in DWC from the DI pipeline. To make views consumable in SAC, make sure that its semantic usage is ‘Analytical Dataset’ and ‘Exposure for Consumption’ is on. Once views from each dataset are saved and deployed, a story can be built in SAC.


 

SAC Story Example


Visualization helps to understand complicated data more easily. Here I made an SAC story with the two datasets stored in DWC.

Below is the example of my story, but of course, how you want to graphically present your datasets is completely up to you.





 

Conclusion


This blog post suggested how to retrieve pipeline usage data in SAP Data Intelligence and visualize them in an SAP Analytics Cloud story. I hope it gave you an idea of how to extract and analyze pipeline usage details. Also, I would like to underscore the flexibility of working with DI. You can easily customize the graph information you would like to retrieve and make your own usage statistics. The entire script of my project is in the GitHub repository as a reference.

On a personal note, I feel grateful to have had hands-on experience with SAP products and to have taken part in this exciting project. Special thanks go to christian.sengstock and thorsten.hapke who generously provided support and feedback during the course of the project.

Thank you for reading and please free to share your thoughts in a comment!
3 Comments