cancel
Showing results for 
Search instead for 
Did you mean: 

Algorithm behind duplicate check rule

thomas_rischar
Explorer
5,007

The Business Partner Duplicate Check is used to find out potential business partner duplicates existing in the system at the time of creation of a new business partner. It depends on HANA capabilities for performing a fuzzy search on the database with the given data and for calculating the overall rank for the records that have some match. HANA uses the Levenshtein algorithm for calculating the similarity of terms. Combining the capabilities of HANA’s error tolerant search and ranking with FSI’s flexibility to create dynamic SQL ‘Where’ clauses, business partner duplicate check finds duplicates reliably and also efficiently.

Levenshtein algorithm:
Named after its developer Vladimir Levenshtein, 'Levenshtein Algorithm' employs the calculation of the Levenshtein Distance i.e. the distance between two given strings or sequences by finding out the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers.

The Duplicate Check considers these “nodes” (= collection of attributes) when calculating the similarity score: name, postal address, e-mail, fax, phone, birthdate (only for persons), additional identifiers. I. e. it calculates the single similarity of the above mentioned nodes compared to an existing business partner on the database and finally aggregates the single similarities to a total similarity. In this step the nodes contribute to the total similarity with different weights which are configured in the Business Configuration Finetuning “Duplicate Check Weighting For Business Partners”.

A short example shall illustrate this:
Let’s assume you are creating a new contact with these data
Name = Bond
E-mail = j.bond@mi6.com

Let’s say there is already this similar contact existing on the database:
Name = Band
E-mail = j.bond@mi6.com

The similarity of the names is 1 – ¼ = 0.75 = 75%.
The similarity of the e-mail addresses is 1 = 100%.

Now the total similarity depends on the weights for name and e-mail configured in “Duplicate Check Weighting For Business Partners”. Only the weights for name and e-mail are relevant, as the similarity of the other nodes is 0.

Let’s assume we have these weights:
Name weight = 30
E-mail weight = 70

To combine the single similarities the Duplicate Check uses the so called OR-formula:

with
R = rank or fuzzy similarity of a node

w = node weight

i = index value of the node

n = number of individual nodes


For this example we get this total similarity:
SIM = SQRT( (30² * 0.75² + 70² * 1²) / (30² + 70²) ) = SQRT( 5406.25 / 5800 ) = 0.97 = 97%

If you decrease the weight for Name and increase the weight for E-mail the total similarity will increase as well, as the contribution of the 100% matching node ‘e-mail’ increases compared to the node ‘name’ which has got a lower similarity.

Now let’s have a look at a second example where we enter only the name for the new contact:
Name = Bond

Now the configuration in “Duplicate Check Weighting For Business Partners” is completely irrelevant as there are no parts to be combined.
That means the total similarity is just the single similarity of the name part independent of any weights:
SIM = 0.75 = 75%

Summary:
The node weights you are configuring in the “Duplicate Check Weighting For Business Partners” only change the influence how much each individual of the above mentioned nodes (e. g. name or address) contribute to the total similarity compared to the other nodes.
Note, that the node weights are relative weights, i. e. only the ratio of a node weight compared to the other node weights is relevant and not the absolute value. If you e. g. double all node weights the calculated fuzzy similarity keeps the same.

former_member798405
Discoverer
0 Kudos

Hello,

I have a question. Thomas described the name and email similarity with a simple example below:

A short example shall illustrate this:
Let’s assume you are creating a new contact with these data
Name = Bond
E-mail = j.bond@mi6.com

Let’s say there is already this similar contact existing on the database:
Name = Band
E-mail = j.bond@mi6.com

The similarity of the names is 1 – ¼ = 0.75 = 75%.
The similarity of the e-mail addresses is 1 = 100%.


What if we change our example like that:

Lets assume my 1. data that I want to compare with other data is like that:

Name1 = Bla Bla Company Ltd. Co.

Address1 = Bla Bla Street No:1769 Newyork

2. data is:

Name2 = Bla Bla Bla Company Ltd. Co.

Address2 = Bla Bla Street No:1768 NewYork

And when I compared the two names with VBA (macro in excel) with Levenshtein distance formula, I get the result like "4"

So, what would be my "The similarity of the names"?

How can I calculate this?

is it possible to make a calculation like that:

1- (levensthein distance/ lenght of the 1. name ) = 1 - 4/24 = 0,83?

And I can find address similarity with the same way and it would be = 1 - 2/30 = 0,93?

And then we can put this ratios into our similarity formula:

lets assume

name weight is:70

address weight is:20

and our similarity =SQRT((70²*0,83²+20²*0,93²)/(70²+20²)) = 0,83

It this the correct way. I ask this question especially to proof "The similarity of the names" and "The similarity of the address"

Regards,

jerryjanda
Community Manager
Community Manager
0 Kudos

We're glad to have you in the community looking for answers to your questions, but you posted a question as an answer (which I converted to a comment) in an active thread (with former members) from 2015.

I want you to get the help you need, but you're unlikely to get any responses this way. Therefore, I'd like to offer some friendly advice:

* Try asking a new question instead at https://answers.sap.com/questions/ask.html.

* Familiarize yourself with https://community.sap.com/resources/questions-and-answers, as it provides tips for preparing questions that draw responses from our members.

* Take our Q&A tutorial at https://developers.sap.com/tutorials/community-qa.html, as that will also help you when preparing questions for the community.

* Complete your profile by following the steps at https://developers.sap.com/tutorials/community-profile.html, as a complete profile encourages readers to respond.

I hope you find this advice useful, and we're happy to have you as part of SAP Community!

Kind regards,

--Jerry

thomas_rischar
Explorer
0 Kudos

Hello Meryem,

Please have a look at the FAQs: Business Partner Duplicate Check I provided some time ago. Question Q1 describes how the similarity score for the name is calculated.

Best regards,
Thomas

former_member798405
Discoverer
0 Kudos

Hi Thomas,

Thank you so much for your information. When I see your reply to Q1, I am confused a little bit.

The given example is as follows:

Example:
Search String: Formatted name = Benjamin Franklin
Database Record: Formatted name = Banjaminn Franklin
‘Benjamin’ best matches to ‘Banjaminn’. Similarity with Levenshtein is 62% (1 x Update, 1 x Deletes, 1 x Insert, i. e. similarity = 1 – 3/8)
‘Franklin’ best matches to ‘Franklin’. Similarity obviously is 100%.
Finally, the two similarity values are aggregated by using the OR formula for token search within words:
SQRT( ( 0.62² + 1² ) / 2 ) ) = 83%.

1. My first question is about the calculation of the similarity of the name on this example:

When I use the levensthein distance formula via Excel VBA, it says the distance between "Benjamin" and "Banjaminn" is 2 not 3:

But in your calculation you calculated 3 (Benjamin’ best matches to ‘Banjaminn’. Similarity with Levenshtein is 62% (1 x Update, 1 x Deletes, 1 x Insert, i. e. similarity = 1 – 3/8)).

So our numerators are different. Levenshtein formula on VBA says, it would be 2. How could you find 3?

2. My second question is about way of comparison.

In your example you gave, you consider the indicators of the names separately and you call them as "token"s?

For example there is another example you gave:

Example:
Wintervoss Corp‘ and ‘Wintersnow Corp’ have similarity of 82%, as the similarity of ‘Wintervoss’ and ‘Wintersnow’ calculated with Levenshtein algorithm is 60%.

SQRT( ( 0.60² + 1² ) / 2 ) = 82%

How should we estimate the name similarity if our examples like as the following?:

New name: Beijing Piaomir Digital Technology Ltd.

Existing name: Beijing Vuaweir Digital Technology Co., Ltd.

Should we compare the "Beijing" with Beijing",

"Piaomir" with "Vuaweir",

"Digital" with "Digital",

" Technology" with "Technology",

"Ltd." with "Co., " and

" " with "Ltd."?

And then system takes the average of all of these tokens' similarity?

Many thanks in advance for your reply.

Best Regards,

Meryem


former_member798405
Discoverer
0 Kudos

Thomas,

Hi again. I guess we understand the problem of name duplication check issue.

We have to think about the name of the address with it's "tokens".

But how about the address? For example in our system the multipliers of the name and address are:

Name:70

Address:20

When we keep the address multiplier like above, system searches for the city name within the name fields on the database, and brings the names which includes city name on the address:

I can give an example like below:

this is the company name that I want to create on the system:

Siaomi Digital Technology Ltd. And the city of this company is Beijing.

And the system shows me the duplication results like below:

As you see, the most similar results starts with "Beijing" and this is the city field of my company but doesn't exist in the name field.

What kind of a search algorithm exists behind this duplicate check if name and address have multipliers 70 and 20?

If you inform us we would be so happy.

Best Regards,

Meryem

thomas_rischar
Explorer
0 Kudos

Hello Meryem,

The weights in your example (Name: 70, Address: 20, and all others 0) mean that only the Name and the (Postal) Address fields are regarded during the Duplicate Check. And they define how much the Name and (Postal) Address contribute to the total similarity.

Like in my example in this BLOG with e-mail and name we have a 2-step approach. First step: calculate the Name similarity and the (Postal) Address similarity. Second step: these two similarities are combined using the OR formula to a total similarity. The only difference is that also the calculation of the (Postal) Address similarity is done in 2 steps. First step: calculate the similarity of each (Postal) Address field that is maintained in the instance: e. g. Country, City and Street. Second step: combine the Country similarity, the City similarity and the Street similarity to a (Postal) Address similarity using the OR formula (with fixed weights for the particular address fields).

Note that (Postal) Address attributes are NEVER used to calculate a Name similarity. That means even though a City name might occur in the Account’s Name it does not change the Name similarity whether the City (of the Address) matches to a “City” token in the Name or not.

Best regards,
Thomas

Accepted Solutions (0)

Answers (3)

Answers (3)

Former Member
0 Kudos

Hello Thomas,

besides the standard delivery which you've described above, is there an option to adapt the standard algorithm? Or even to replace it by a custom logic?

If YES - could you please provide directions or an example?

Thanks in advance for any any hints.

Best regards,

Christian.

thomas_rischar
Explorer

Hello Christian,

It is only possible to set the threshold in the Business Configuration Scoping or to set the weights in the Business Configuration Finetuning "Duplicate Check Weighting for Business Partners" as described above.

It is neither possible to adapt the algorithm nor to replace it.

Best regards,

Thomas

former_member210407
Active Participant
0 Kudos

Hello,

As Thomas said it is not possible to adapt/enhence the duplicate check algorithm with custom logic but it is possible to put custom logic in PDI scripts and block the saving if customer data does not match the requirements.

Ex: If you use a very discriminating data like VAT number in europe, you don't want to have two accounts with the same VAT number.

In the account on validation event you execute a query in accounts with the VAT number as parameter. If query returns any data you have a duplicate and therefore you will return false (block the save) and raise a custom error message.

Best regards,

Benjamin

Former Member
0 Kudos

Hello Thomas,

thank you for the clarification!

Christian.

Former Member
0 Kudos

Hello Benjamin,

thank you for the proposed solution. In case of a very large customer basis, I suppose that this new custom query might be quite performance intensive though. But it's good to know this option!

Best regards,

Christian.

JohnHunter
Discoverer
Former Member
0 Kudos

Hi Thomas,

thank you for the very good and detailed reply.

Our Client's IT Department asked us a quick question: how much is then the threshold on the Levenshtein distance in order to include a Business Partner in the Checked Duplicates list? We guess should be around 80%~90%, is that correct?

Thanks a lot,

Davide

thomas_rischar
Explorer
0 Kudos

Hi Davide,

In the Business Configuration Scoping you can configure this threshold. You can
choose if you want a Strong, Medium or Weak Duplicate Check for Business Partners.

Here are the respective thresholds:
Strong Duplicate Check for Business Partners’: 85%
Medium Duplicate Check for Business Partners’: 80%

Weak Duplicate Check for Business Partners’: 70%

The default setting is ‘Medium Duplicate Check for Business Partners’, i. e. a threshold of 80%.

Best regards,
Thomas

Former Member
0 Kudos

Thank you, Thomas. It was exactly the information we needed about.

Cheers,

Davide

JohnHunter
Discoverer
0 Kudos
Former Member
0 Kudos

Thomas, thanks a LOT for this complete answer. Now I will stop telling my customers the duplicate check is just a fancy Black Box

I can now give them the whole formula and even suggest an evening course for Mathematics!