Introduction
A data management platform like SAP Data Intelligence (DI) needs a powerful metadata management. Without some kind of semantics data separated from the application or the data producer is close to useless. The table name and the column name is mostly not sufficient in particular if they are mostly cryptic abbreviations. SAP Data Intelligence puts a lot of effort into extending its Catalog (=Metadata Explorer) to optimise the value of the managed data. For clarification when using the term "catalog" I am referring to the Catalog application of the Metadata Explorer. The Rulebook, Data Preparation and Glossary are extremely important complimentary products that rounds up the catalog by adding quality measurements, data manipulations (quality improvements or data normalisation) and extended information. The glossary could also be seen as an integral part of the catalog because it allows to add structured or template based textfields to the datasets and data attributes (e.g. columns). For the reason that you can link glossaries also to Rules, Rulebooks and other Glossary terms it is kept separate.
The main focus of this blog after a short introduction into the core catalog of SAP Data Intelligence are the openness of the catalogue. Basically how you can
- import catalogue hierarchies,
- tag DI-managed datasets with hierarchy items in sync with a 3rd party catalog
We learnt that it is crucial for a data management application to be open and there might not be one central catalog used in enterprises. Only catalogs that are open will finally succeed and this is what we want to provide to our customers and what I hope you will get out of this blog.
Short Story of DI Catalog
The catalog is a subset of datasets that the users have access to via the
"Browse Connection" app. You have to
publish the datasets in order to manage it as a catalog item. This assumes that you might have much more data than you can actually manage in an orderly way and serves as a first filter when looking for datasets.
After the publication you can already search the data by
- Data sources
- Dataset types
- Ratings
- Text search on dataset and attribute names, comment textfields etc.
You can now tag the datasets using tag-hierarchies manually. Due to fact that creating hierarchies is an administration task it has been lately separated into a separate app unter
"Metadata/Administration/Manage Hierarchies".
You can create several hierarchies that establishes a different order on your datasets, e.g. Licences, Language, DataQuality, ...
There is also an auto-classification provided that is assigning tags on column/data field level. This can be invoked by running a
Profiling after the Publication. The provided hierarchy
"ContentType" contains hierarchies as DataTypes, Location, Logging, Personal_Information with sub-tags. For this task heuristics are used that based on the information of the column names, data types and actual content. In particular the Personal Information could be very helpful to signal special treatment of this dataset. Be aware that these tags are only good proposals.
Exchange Catalog Hierarchies
Although it might not be a daily task importing and exporting catalog hierarchies, it is nonetheless extremely helpful and even necessary when you like to synchronise catalogs or specialised catalog services that are tagging data based on algorithms like our
SAP partner BigData.
Therefore we have published a
Metadata Management API on API Business Hub that enables to build an interface to the SAP Data Intelligence Catalog.
The RestAPIs that we need for exchanging hierarchies are:
Export
- GET /catalog/tagHierarchies/ - exports all hierarchies or specified with its attributes including the hierarchy id
- GET /catalog/tagHierarchies/{hierarchy_id} - exports the full hierarchy using the hierarchy id provided by the previous RestAPI call
Import
- POST /api/v1/catalog/tagHierarchies - imports the hierarchies without tags
- POST /catalog/tagHierarchies/{hierarchy_id}/tags - adds tags to hierarchy by given the hierarchy id and the existing parent di of the tag
The full HTTP URL: https://<di system>/
app/datahub-app-metadata/api/v1/<RestAPI resources>
Whereas the export of the hierarchy is quite straightforward, the import needs some more preparations. Because there is no general "import new hierarchy" call, you have to add each tag separately ensuring that the hierarchy and the parent tags already exist. Of course there are numerous ways how you can accomplish this task. My solution is firstly downloading the hierarchies, if there is any, and after checking which tags does not exist and then upload only the new ones. You just have to make sure that the more "parental" tags are added first.
Test Export Hierarchy
You can test the export by first installing my diadmin (
github) package:
pip install diadmin>=0.0.51
create a config.yaml for the credentials:
PWD: pwd123
TENANT: default
URL: https://vsystem.ingress.xxx.shoot.live.k8s-hana.ondemand.com
USER: user
and then running the script
dicatalog hierarchies * --config config --download
This downloads all hierarchies to the sub-folder catalogs of your current working directory.
I have deliberately not chosen the tree-structure and the details of the RestAPI response. For synchronising the catalog a simple map is easier to use from a programming perspective and the editing as well.
The structure I use is a map of a composite key (hierarchy-name/tag-level1/tag-level1...) to the details of the tag. For this structure there is no difference between the "root"-tag and the hierarchy contrary to the SAP Data Intelligence hierarchy.
"Availability": {
"name": "Availability",
"description": "Availability of dataset (https://www.dcat-ap.de/def/plannedAvailability/1_0.html)",
"path": "Availability",
"tag_id": "",
"parent_path": "",
"hierarchy_id": "D68C0CE0066D79CF1700BA7204BDA9BD",
"hierarchy_name": "Availability"
},
"Availability/available": {
"name": "available",
"description": "Mid-term available for at least a couple of years (http://dcat-ap.de/def/plannedAvailability/available)",
"path": "Availability/available",
"tag_id": "D98C0CE0066D79CF1700BA7204BDA9BD",
"parent_path": "",
"hierarchy_id": "D68C0CE0066D79CF1700BA7204BDA9BD",
"hierarchy_name": "Availability"
},
The "tag_id" and the "hierarchy_id" gives you the link to the hierarchy/tags within SAP Data Intelligence. All the omitted information, you would get from the GET RestAPI call, is used for DI internal processing.
Test Import Hierarchy
The import takes the same data format as the export delivers. For testing purpose you can cut and paste a hierarchy from the exported file "hierarchies.json" and put it into a new file with the name of the new hierarchy, e.g. "License.json" and change the values accordingly. Of course you can omit the tag_id and hierarchy_id.
dicatalog hierarchies License --config config --upload
You do not have to add the suffix by the way. The configuration-file expects a 'yaml'-file and the hierarchies a 'json'-file.
Import/Export Pipeline
Of course you can also use a SAP Data Intelligence pipeline for importing/exporting. For this you need to create first a new docker image with the same package used in the command line version:
ARG DEPENDENCY_BASE_IMAGE_VERSION=2107.0.1
FROM §/com.sap.datahub.linuxx86_64/sles:${DEPENDENCY_BASE_IMAGE_VERSION}
RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow
RUN python3 -m pip install diadmin
and create 2 custom operators.
Export:
from urllib.parse import urljoin
from diadmin.metadata_api import catalog
def gen():
host = api.config.http_connection['connectionProperties']['host']
pwd = api.config.http_connection['connectionProperties']['password']
user = api.config.http_connection['connectionProperties']['user']
path = api.config.http_connection['connectionProperties']['path']
tenant = api.config.tenant
conn = {'url':urljoin(host,path),'auth':(tenant+'\\'+ user,pwd)}
api.logger.info(f"Download hierarchies: {api.config.hierarchies}")
hierarchies = catalog.download_hierarchies(conn,hierarchy_name=api.config.hierarchies)
# attribute used in the next write-file operator of parameterized file
hierarchies_name = 'hierarchies' if not api.config.hierarchies else api.config.hierarchies
msg = api.Message(attributes={'hierarchies':hierarchies_name},body=hierarchies)
api.send('output',msg) # dev_data type: message
api.add_generator(gen)
Import:
import copy
import io
import json
from urllib.parse import urljoin
from diadmin.metadata_api import catalog
def on_input(msg):
host = api.config.http_connection['connectionProperties']['host']
pwd = api.config.http_connection['connectionProperties']['password']
user = api.config.http_connection['connectionProperties']['user']
path = api.config.http_connection['connectionProperties']['path']
tenant = api.config.tenant
conn = {'url':urljoin(host,path),'auth':(tenant+'\\'+ user,pwd)}
new_hierarchies = json.load(io.BytesIO(msg.body))
hierarchies = catalog.upload_hierarchies(conn,new_hierarchies,hierarchies=None)
att = copy.deepcopy(msg.attributes)
msg_success = api.Message(attributes=att,body=hierarchies)
api.send('success',msg_success) # dev_data type: message
api.set_port_callback('input',on_input)
Because of security reasons you do not longer have access to your login-user credentials via environment variables. Therefore I chose to create a HTTP-connection with the DI url and the user credentials and use the configuration of the operator to read the data including the tenant.
If you like to synchronise your catalog hierarchies with the SAP Data Intelligence hierarchy you just need to write a format converter script and put this as a separate custom operator before the export/import operator or adjust these operators directly. Then you can call the pipeline externally via the RestAPI published on the
SAP API Buiness Hub. You can download my operators example from the
diadmin github.
Exchange Catalog Tags of Datasets
In case you have data sources that are managed by different catalogs you might want to synchronise the tags along with the catalog hierarchies/tags. The main challenge is that you need to map the file management of the external catalog with the one SAP Data Intelligence uses. Fortunately SAP Data Intelligence is mostly adopting the file organisation from the source system.
This means that for the object store data sources the folder structure is taken over with a small limitation. The access to a dataset is a combination of the
connection defined in the
Connection Management and the container (= folder) hierarchy. The connection - referred by a
connectionId - can also have a root path. From this follows that the actual path is a concatenation of the connection root path and the folder.
For databases it is even simpler. The connection management contains only the database connection details and the user credentials. For mapping the tables you need just the schema and the table name.
If you like to extend the automatic integration of data sources you can download the connection definition with the RestAPI resource /catalog/connections or using
dicatalog connections * --config config --download
After this preparation you just need to download the
dataset tags and the
dataset attributes tags. The attributes of a dataset are mostly the columns of a dataset and referred to just by the name.
For the download there is only one RestAPI :
here the
connectionId and the qualified name (= path + dataset name) is needed. Be noted that the path-separators needs to be quoted otherwise the web-server interprets the path as a resource path and returns the status: "404-Dataset not found". Something I struggled a while with.
For the upload you need two RestAPIs
/catalog/connections/{connection_id}/datasets/{qualified_name}/tagHierarchies/{hierarchy_id}/tags
/catalog/connections/{connection_id}/datasets/{qualified_name}/attributes/{attributes_qn}/tagHierarchies/{hierarchy_id}/tags
Again you need as parameters the
connectionId and the
qualified_name of the dataset and in addition the
hierarchyId to which the tags belong that are carried in the HTTP-body. For dataset attribute tag you also need the
qualified name of the attribute that is mostly the column or truly speaking that was always the case for my test use cases.
Again there is an asymmetry of what details you get from the GET call and what is needed for the POST and again decided for the most simplest format for my integration:
A
key of the dataset
maps to a
list of tags and a
dictionary of attribute tags, e.g,
{
"S3_Catalog/costcenter/MANILA_CSKA.csv": {
"dataset_tags": [
"Availability/available"
],
"attribute_tags": {
"ERSDA": "ContentType/DATA TYPES/DATE",
"FUNC_AREA": "Location/Europe/Spain",
"MANDT": "ContentType/DATA TYPES/NUMERIC"
}
},...
}
The key is a composite of the connectionID and the path of the dataset. For the tags the "root"-node defines the hierarchyId together with the hierarchy path.
For testing you can first download the tags of a dataset with the script
dicatalog tags S3_Catalog/costcenter/SEATTLE_CSKA.csv --config config --download
as an argument you need to provide the connectionId and the path to the dataset. The tags are appended to the file
catalogs/dataset_tags.json. From there you can cut and paste an item and store it in a separate 'json'-file and add new tags to the datasets and attributes. Then run the script
dicatalog tags newfile.json --config config --upload
if you like to speed up the upload you can first download the hierarchies (dicatalog hierarchies * -d) and use the option --synced. Without this option for every upload of the tags the hierarchies are downloaded first to get the hierarchy IDs that are needed. With the wildcard instead of a specified tag-file the dataset_tags.json is uploaded.
Again you can also implement the tag-synchronisation as a pipeline by adding this code to a custom Python operator.
Export dataset and attributes tags:
from diadmin.metadata_api import catalog, container
...
tags = dict()
# Get all the information of the dataset
dataset_attributes = container.get_dataset_tags(conn,dataset_path)
# Reduce the information to the bare basics
tags[dataset_path] = container.reduce_dataset_attributes(dataset_attributes)
...
Import dataset and attribute tags:
from diadmin.metadata_api import catalog, container
...
# loop over the datasets
for ds,dsv in dataset_tags.items():
# loop over the tags (dataset and attributes)
for dst in dsv['dataset_tags']:
# dataset tags
hierarchy_id = hierarchies[dst]['hierarchy_id']
tag_id = hierarchies[dst]['tag_id']
container.add_dataset_tag(conn,ds,hierarchy_id,tag_id)
# dataset attributes
for attribute, ast in dsv['attribute_tags'].items():
hierarchy_id = hierarchies[ast]['hierarchy_id']
tag_id = hierarchies[ast]['tag_id']
container.add_dataset_attribute_tag(conn,ds,hierarchy_id,tag_id,attribute)
...
For the latter you see that the hierarchyId and the tagId is needed, therefore you have to provide it to the procedure.
Conclusion
This blog should give you a head start for an integration project of an external catalog to the SAP Data Intelligence catalog. As you have seen it is based on simple structures and therefore such a project should be feasible within a reasonable timeframe. You just need to exchange the label/tag hierarchies and the labels/tags of a dataset with a few RestAPI calls.
Of course it would have been nice to integrate SAP Data Intelligence catalog with an external catalog like Apache Atlas, but the setup of a catalog is not done within a couple of hours. Therefore I decided to rather wait until a project comes along the way.
Another topic I have not covered in this blog is an extension of SAP Data Intelligence to auto-tagging. If you have an idea what kind of auto-tagging you would like to see and maybe if you have a ml-model that we can use for it instead heuristic rules then please let me know. My assumption is that it should not be a big deal because all what we need is at hand: Metadata of the dataset, sample data and the API.