Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
Showing results for 
Search instead for 
Did you mean: 
Former Member

Creating a Data Cleansing Solution for Multiple Sources


Data Cleansing Advisor can rapidly generate cleansing and match rules for a single source that contains party data.  The data cleansing solution that ultimately
gets published to Data Services Workbench can easily be extended to support data cleansing initiatives using multiple sources of data.  Previous sections within this article described how to extend the data cleansing solution to add best record functionality, configure the solution for Match Review and to publish to Data
Services Designer to get full customization over the solution.

There are two options when wanting to use multiple sources with Data Cleansing Advisor.  The first option is to simply merge (Using Workbench or Designer) the two sources of data so that the merged source has a harmonized schema.  This source, which now contains data from multiple sources, can then be added as a connection in Information Steward.  This has an advantage over the next option because the Data Steward will be able to review and fine tune the
results.  The second option, and what this section will focus on, is to add the multiple sources of data individually to a dataflow using a data cleansing solution that has already been reviewed and published.

This section will leverage everything that has been described in terms of extensibility of the data cleansing solution in order to create a dataflow in Data Services Workbench or Designer to cleanse and de-duplicate multiple sources of data and stage the results for Match Review.

Adding Multiple Sources of Data to the Solution

Since Data Cleansing Advisor supports just a single source of data, the dataflow that gets created within Workbench or Designer needs to be extended slightly to support multiple sources of data using a single data cleansing solution.  The input sources (ONLINE_PROSPECTS and  ONLINE_PROSPECTS_MULT_PAGES) and the rest of the transforms (below in grey) need to be added to the dataflow.

When multiple sources of data are being used as input into a single transform, then a merge transform will need to be used in order to harmonize the schemas of each source.  In this example, each source I’m using has the same number of columns, their names are the same, but the data types are different.  Query transforms (“PrepNA” and “PrepMult”) are used to harmonize the input schemas so that they are standardized before being input to the data
cleansing solution (“Prospects_Global”).  This is also a good place to add a column called “SOURCE_SYSTEM” to identify the system that the record originated from.  This column is also used in Match Review.  After preparing the data for the data cleansing solution, the merge transform should have the following

Data Cleansing Solution Input Mapping

Data Cleansing Advisor uses best practices and content types to automatically map the data columns from the input source to the data cleansing solution.  The columns that are used within the data cleansing solution cannot be modified.  This means that if your different sources of data have different schemas, then you need to ensure that the harmonized schema matches what was used to create the data cleansing solution.  In some situations, you may need to get
creative.  Below is an image of the data cleansing solution’s input schemas and the required fields that will need to be mapped.

Further Customization

After the input schemas have been harmonized and the data mapped to the data cleansing solution, the dataflow is ready to be executed in order to generate results.  At this point in time the dataflow can be modified to be published to Designer to add best record functionality, Match Review can be configured or anything else that you want in regards to functionality.  Below is an image that shows the last half of the dataflow to configure a Match Review task using output from multiple sources of data to stage suspect records and populate a table with auto matches.

The function and content of each query and case transform above is the same as described in the section called  “Match Review with DCA”.  The configured dataflow for both Workbench and Designer are the following, respectively:

Data Cleansing Advisor Best Practices Blog Series

Determining Duplicates and a Matching Strategy

Publishing to Data Services Designer

Configuring Best Record Using Data Services Designer

Match Review with Data Cleansing Advisor (DCA)

Data Quality Assessment for Party Data

Using Data Cleansing Advisor (DCA) to Estimate Match Review Tasks

Creating a Data Cleansing Solution for Multiple Sources