cancel
Showing results for 
Search instead for 
Did you mean: 
Read only

Fuzzy search

Former Member
0 Likes
1,968

I'm having a hard time even trying to frame my question, so let me start at the simplest level.

Can someone help me understand why there is a difference between a text search and a string search?

I'm trying to do some person matching, and the majority of the fields are strings.  However, quite a bit of the fuzzy matching functionality doesn't work with strings, only with text.

My population right now is mainly mothers.  So I'm trying to figure out how to write a fuzzy matching string that will compare their last names.  I have one dataset that has fields for maiden name and last name, and one that just has last name.  Also, it seems that people with hyphenated names will sometimes put in one name, sometimes the other, and sometimes the hyphenated name.  So I'd really like to write a match statement that will look at my maiden and last names and compare against the last names in the other data set, and if there is any one word in there that matches to pass that record.

For example, I'd like to get matches on all of these types of situations:

Maiden: Johnson Last1: Smith -- Last2: Johnson-Smith

Maiden: Johnson Last1: Johnson-Smith -- Last2 Smith

Maiden: Jones   Last1: Frank  -- Last2: Jones

Maiden: Anderson  Last1: Miller -- Last2: Miller

If any one word matches across, I want to use it.  Do I need to convert all my data to text instead of varchar, so that I can get all the various token matching options?

Another question I have is about empty values.  I found the emptyscore option to tell it what score to give if one or both sides are empty.  Is there a way to just tell it to ignore that comparison if the values are empty?

For example if I'm comparing on first name, last name, date of birth, and ssn, and I have exact matches on all of those, but the SSNs are blank, is there a way to just say not to put that calculation in at all?

Here's what my cursor looks like right now, please ignore any obvious typos as I can't cut and paste directly from my system

declare cursor c_data1 for

     select * from data1;

for c_mom as c_data1 do

     insert into fuzzy_match_table

          select c_mom.id, d2.id, c_mom.last, d2.last, d2.maiden, c_mom.first, d2.first, c_mom.dob, d2.dob,

               c_mom.ssn, d2.ssn, score() as score, current_timestamp

          from data2 d2

          where contains(d2.last, c_mom.last, fuzzy(0.8, 'emptyScore=0.7))

               and contains(d2.first, c_mom.first, fuzzy(0.8, 'emptyScore=0.7))

               and contains(d2.dob, c_mom.dob, fuzzy(0.8, 'emptyScore=0.7))

               and contains(d2.ssn, c_mom.ssn, fuzzy(0.8, 'emptyScore=0.7));

end for;

TL/DR: How can I get the above cursor to check maiden names, and how can I either properly weigh or ignore my empty fields so they aren't having so much influence over my score.

Accepted Solutions (1)

Accepted Solutions (1)

lucas_oliveira
Product and Topic Expert
Product and Topic Expert
0 Likes

Hi Vanda,

Not sure what you're looking for but varchar/nvarchar can be used with fuzzy search as long as you have fulltext index on. That's in the documentation.

If I get this right you're tring to compare the columns of your rows ( one at a time since contains does not allow column value as second parameter). Despite the usage of 'and' on your code (seems like you actually want to combine those with 'or' predicate instead) it looks you got it going. Did I get this wrong?

Seems the scenario could be done with like_regexp as well... not sure if you have looked into that already.

I tried to build up some data and use your procedure to get the desired results. Sligthly changed it to get there. But honestly I'm not sure this is what you're looking for. This is what I got so far:


drop table population cascade;

create column table population (

  id integer,

  maiden varchar(100),

  last1  varchar(100),

  last2  varchar(100)

);

create fulltext index mindex on population(maiden);

create fulltext index l1index on population(last1);

create fulltext index l2index on population(last2);

insert into population values(1,'Johnson' , 'Smith'   ,'Johnson-Smith');

insert into population values(2,'Johnson' , 'Johnson-Smith'  ,'Smith'        );

insert into population values(3,'Jones'   , 'Frank'   ,'Jones'        );

insert into population values(4,'Anderson', 'Miller'   ,'Miller'       );

insert into population values(5,'Jenny' , 'Miller'   ,'Sanders'       );

insert into population values(6,'Susan' , 'Harris'   ,''       );

drop table populationtemp;

create global temporary table populationtemp (

  id integer,

  maiden varchar(100),

  last1  varchar(100),

  last2  varchar(100),

  score double

);

do

  begin

  declare cursor cur for select * from population;

  truncate table populationtemp;

  for i as cur do

  insert into populationtemp

  select *, score() as score from population m where m.id = i.id and  -- compare only same id, and optional scenarios

  (m.last1 <> '' or m.last2 <> '') and -- ignore when last1 or last2 are empty

  ( contains(m.last1,i.last2, fuzzy(0.8)) or               -- last2 contained in last1

   contains(m.last2,i.last1, fuzzy(0.8)) or -- last1 contained in last2

   contains(m.maiden,i.last1, fuzzy(0.8)) or -- maiden contained in last1

   contains(m.maiden,i.last2, fuzzy(0.8)) ); -- maien contained in last2

  end for;

  select * from populationtemp;

  end;

For the test I got the result below:

IDMAIDENLAST1LAST2SCORE
1JohnsonSmithJohnson-Smith0.32500001788139343
2JohnsonJohnson-SmithSmith0.34285715222358704
3JonesFrankJones0.5
4AndersonMillerMiller0.7071067690849304

I guess it's better for you to elaborate your scenario a bit more. A few sample data, the create statement of your population table and the desired output would possibly help out.

BRs,

Lucas de Oliveira

Former Member
0 Likes

Lucas,

Thanks for the response.  I know I did get it going, but I'm trying to find a way to refine my results to get better match scores.

I don't think what I'm trying to do is really a complicated concept, but I just seem to be having a hard time trying to get it nailed down.

I have two populations of women, one population has data from right around the time of a birth, the other population has data from a variety of times.  Using the PII I have, which is first name, last name, DOB, SSN, and in one dataset the maiden name, I want to match up the women across the two datasets.

So if in one data set I have these sets of records

First        Last             Maiden          DOB            SSN

Joan     Johnson          Jones           1/11/2011     111111111

Joan     Jones                                  1/11/2011     111111111

Betty     Brown-Bones  Bones            3/13/2013     333333333

Betty     Brown                                  3/13/2013     333333333

I'd like these example to match, and to have a nice high match score.  So I'd like to compare last name vs last name and maiden name and if any one part of those names matches, get a nice high match score so that it doesn't pull my averaged score() down.

I think I need the and for the other factors, because I do want to match on all of those things.  But I'm trying to figure out how to add in the maiden name matching without pulling the score down.  If you look at your example, your highest match is a .70 which is below what I'd consider a good matching threshold.  I did try an OR early on, with the last and maiden names, and got similar results. 

That's why I was wondering if I needed to switch the fields from strings to text so that I can try and use some of the parameters around individual token matching?

Something like

where contains ((d2.last,d2.maiden), c_mom.last, fuzzy(0.5, 'considerNotMatchingTokens=min, bestMatchingTokenWeight=1.0, excessTokenWeight=0.1')

I'm not sure if that's even a valid clause or not, but something around that idea?

Also on the blank records, I don't want to ignore them as in not match them, I just want to ignore them for the purposes of calculating the score.  I'm actually not even sure I want to do that, anymore, but it was question I figured I'd throw out there while I was asking about fuzzy matching.

Answers (0)