Solved: Fuzzy search

Former Member · ‎2015 Aug 31

I'm having a hard time even trying to frame my question, so let me start at the simplest level.

Can someone help me understand why there is a difference between a text search and a string search?

I'm trying to do some person matching, and the majority of the fields are strings. However, quite a bit of the fuzzy matching functionality doesn't work with strings, only with text.

My population right now is mainly mothers. So I'm trying to figure out how to write a fuzzy matching string that will compare their last names. I have one dataset that has fields for maiden name and last name, and one that just has last name. Also, it seems that people with hyphenated names will sometimes put in one name, sometimes the other, and sometimes the hyphenated name. So I'd really like to write a match statement that will look at my maiden and last names and compare against the last names in the other data set, and if there is any one word in there that matches to pass that record.

For example, I'd like to get matches on all of these types of situations:

Maiden: Johnson Last1: Smith -- Last2: Johnson-Smith

Maiden: Johnson Last1: Johnson-Smith -- Last2 Smith

Maiden: Jones Last1: Frank -- Last2: Jones

Maiden: Anderson Last1: Miller -- Last2: Miller

If any one word matches across, I want to use it. Do I need to convert all my data to text instead of varchar, so that I can get all the various token matching options?

Another question I have is about empty values. I found the emptyscore option to tell it what score to give if one or both sides are empty. Is there a way to just tell it to ignore that comparison if the values are empty?

For example if I'm comparing on first name, last name, date of birth, and ssn, and I have exact matches on all of those, but the SSNs are blank, is there a way to just say not to put that calculation in at all?

Here's what my cursor looks like right now, please ignore any obvious typos as I can't cut and paste directly from my system

declare cursor c_data1 for

select * from data1;

for c_mom as c_data1 do

insert into fuzzy_match_table

select c_mom.id, d2.id, c_mom.last, d2.last, d2.maiden, c_mom.first, d2.first, c_mom.dob, d2.dob,

c_mom.ssn, d2.ssn, score() as score, current_timestamp

from data2 d2

where contains(d2.last, c_mom.last, fuzzy(0.8, 'emptyScore=0.7))

and contains(d2.first, c_mom.first, fuzzy(0.8, 'emptyScore=0.7))

and contains(d2.dob, c_mom.dob, fuzzy(0.8, 'emptyScore=0.7))

and contains(d2.ssn, c_mom.ssn, fuzzy(0.8, 'emptyScore=0.7));

end for;

TL/DR: How can I get the above cursor to check maiden names, and how can I either properly weigh or ignore my empty fields so they aren't having so much influence over my score.

lucas_oliveira · ‎2015 Sep 01

Hi Vanda,

Not sure what you're looking for but varchar/nvarchar can be used with fuzzy search as long as you have fulltext index on. That's in the documentation.

If I get this right you're tring to compare the columns of your rows ( one at a time since contains does not allow column value as second parameter). Despite the usage of 'and' on your code (seems like you actually want to combine those with 'or' predicate instead) it looks you got it going. Did I get this wrong?

Seems the scenario could be done with like_regexp as well... not sure if you have looked into that already.

I tried to build up some data and use your procedure to get the desired results. Sligthly changed it to get there. But honestly I'm not sure this is what you're looking for. This is what I got so far:


drop table population cascade;
create column table population (
  id integer,
  maiden varchar(100),
  last1  varchar(100),
  last2  varchar(100)
);


create fulltext index mindex on population(maiden);
create fulltext index l1index on population(last1);
create fulltext index l2index on population(last2);


insert into population values(1,'Johnson' , 'Smith'   ,'Johnson-Smith');
insert into population values(2,'Johnson' , 'Johnson-Smith'  ,'Smith'        );
insert into population values(3,'Jones'   , 'Frank'   ,'Jones'        );
insert into population values(4,'Anderson', 'Miller'   ,'Miller'       );
insert into population values(5,'Jenny' , 'Miller'   ,'Sanders'       );
insert into population values(6,'Susan' , 'Harris'   ,''       );


drop table populationtemp;
create global temporary table populationtemp (
  id integer,
  maiden varchar(100),
  last1  varchar(100),
  last2  varchar(100),
  score double
);


do
  begin

  declare cursor cur for select * from population;

  truncate table populationtemp;

  for i as cur do
  insert into populationtemp 
  select *, score() as score from population m where m.id = i.id and  -- compare only same id, and optional scenarios
  (m.last1 <> '' or m.last2 <> '') and -- ignore when last1 or last2 are empty
  ( contains(m.last1,i.last2, fuzzy(0.8)) or               -- last2 contained in last1
   contains(m.last2,i.last1, fuzzy(0.8)) or -- last1 contained in last2
   contains(m.maiden,i.last1, fuzzy(0.8)) or -- maiden contained in last1
   contains(m.maiden,i.last2, fuzzy(0.8)) ); -- maien contained in last2
  end for;

  select * from populationtemp;
  end;

For the test I got the result below:

ID	MAIDEN	LAST1	LAST2	SCORE
1	Johnson	Smith	Johnson-Smith	0.32500001788139343
2	Johnson	Johnson-Smith	Smith	0.34285715222358704
3	Jones	Frank	Jones	0.5
4	Anderson	Miller	Miller	0.7071067690849304

I guess it's better for you to elaborate your scenario a bit more. A few sample data, the create statement of your population table and the desired output would possibly help out.

BRs,

Lucas de Oliveira

By Category

Related Content

Activity Groups

Industry Groups

Influence and Feedback Groups

Interest Groups

Location Groups

Customer Only Groups

Forums

Related Resources

Products

Learning and Support

About

My Account

My Account

Fuzzy search

Know the answer?

Need more details?

Accepted Solutions (1)

Accepted Solutions (1)

Answers (0)