cancel
Showing results for 
Search instead for 
Did you mean: 

HANA Fuzzy Search

rohit_trivedi
Participant
0 Kudos

Experts:

My use case is: I want to perform fuzzy search of Names in one table into another table.

                          Both tables have Thousands of records.

Problem: If i use SQL CONTAINS statment to perform fuzzy search, it takes around 20 ms for one check.

               Therefore to check 100K Names, it takes 100000 * .02 = 2000 seconds!! (which is quite high).

What is most performant way to perform fuzzy check with mass data on both sides ?

Thanks & Regards.               

Accepted Solutions (0)

Answers (5)

Answers (5)

Former Member
0 Kudos

I solved this problem by creating a cursor for the field you would put in the search term and looping over each search term.  For each of the search terms I insert the results of the fuzzy search into a table.  I don't know if it's the most efficient way to do it, but it works.  I just submitted a blog post about it called Fuzzy Matching Dirty Data.  About a year late, but hopefully this will help someone.

0 Kudos

Hi Robert,

I think what you propose is more or less what Rohit is already doing. How he is complaining about the performance for mass checks, as the "simple loop" approach does not meet his expectations. However, for 100.000 records to be searched on the 20ms might still not be optimal. The most important things I have learned to take care on in such scenarios is:

1. enable fuzzy indexes AND b-tree indexes for the fields you are searching on

(do not create one huge b-tree index across all searched fields, but one for each field, so HANA can combine those indexes and must not do the column scans).

2. (equal important) enable fuzzy indexes (with alter table) as described in the HANA search documentation).

3. avoid cross table joins for the search table, if this is not one single column table but an attribute view across different physical tables.

If you have a repeating case like this in the context of master data, consider usage of SAP MDG consolidation. This already provides a frame work which brings optimizations like mentioned above out of the box (plus parallelization), and the ability to change and manage your matching configuration via Search-Rule-Sets in HANA Studio (which is much more flexible and maintainable than having SQL-Script-Code and contains-clauses to be maintained.

See also:

Hope this might help further from time to time,

Regards,

Roland

rohit_trivedi
Participant
0 Kudos

Hi Kyle,

            Problem is that even existing capabilities of hardware, CPU is not exploited with the software approach offered.

You can check that SQL CONTAINS statement used for performing fuzzy search accepts only single string as input!

And this is bottleneck. I need to loop through each entry in one table to find hits in target table.

JOIN on two tables is not possible with SQL CONTAINS. Therefore no possibility exists for parallel processing of workset.

Regards,
Rohit.

kyle_mcadam
Active Participant
0 Kudos

Hi Rohit,

Thanks for the details, now I see more clearly, what the problem is.

one option you have here, is to consider a union all query, where you split each section of the union with a seperate sql statement that selects from a portion of table 1,

this way the number of sections in your union all query could potentially be split by the sql optimizer, into parallel queries....(up to the number of available cores)

finally being joined in one result

rohit_trivedi
Participant
0 Kudos

Hi All,

         This misses the point i'm trying to make.

-> Single search (after creating index etc. ) is not a problem. I get results very fast : 20ms!

-> Problem is that i've to search 100,000 records in another table with 100,000 records.

-> since i can perform only 1 search at a time, it takes 100,000 * 20 ms = 2000 seconds to scan complete data set.

Where is the parallelism ?

kyle_mcadam
Active Participant
0 Kudos

Hi Rohit,

So, in light of the new information you've shared, can you elaborate a little by sharing your hardware/landscape information, such as the # of cpus, and # of nodes in your scale out...how much RAM on the machine

Next, because you want to analyze the parallelism, I would analyze the visual plan and the performance trace to be able to see if in fact there is a problem with the system's parallelism functionality

0 Kudos

Hi Rohit

As kyle says, you need an index of the type Fulltext, you must create it in HANA studio. You can use the Indexes tag on the table screen, or create it in SQL console with the following command:

 

CREATE FULLTEXT INDEX myindex ON mytable(col3) FUZZY SEARCH INDEX ON;

You can read a lot more about it in the developer guide.

I have tested it with a table containing 6 million records, it is quite fast.

Former Member
0 Kudos

Hi Johansen

I tried to create the full texted index on NVARCHAR column, using exactly what you suggested.

CREATE FULLTEXT INDEX myindex ON mytable(col3) FUZZY SEARCH INDEX ON;

But, it makes my fuzzy search super slow, instead of speed it up.

I have 8 million data row, and without indexing, it takes 1 second to fuzzy search, and 3 minutes after indexing !!!

Does fulltext indexing only work for text not for string type column?

How indexing really works?

My hana revision is 112.03.

0 Kudos

Ho Fox Lu,

first, as far as I know, the fulltext index is created async by default. So even if your "CREATE FULLTEXT...." command returns immediately, you won't have the index available when the statement returns unless you define it to be created in sync mode. See:

CREATE FULLTEXT INDEX - SAP HANA SQL and System Views Reference - SAP Library

Further, according to a colleague, the creation of a fulltext index creates an additional, hidden column with all values in the delta first. So you should run a "MERGE DELTA" after creation of a full text. See: MERGE DELTA - SAP HANA SQL and System Views Reference - SAP Library

Please let me know if this helps to get your search faster. Consider, with the FUZZY INDEX enabled, you will save a lot of CPU time compared to fuzzy searches with no FUZZY INDEX (without you really burn CPU time).

Roland

Former Member
0 Kudos

Hi Roland,

Thanks for help. Your post resolves some of my confusion.

I find that the indexed table and the table without indexing will have different semantic on the same statement of fuzzy search. They will retrieve different number of results, so the comparison explain nothing.

select COL from MYTABLE where contains("COL",'john',fuzzy(0.0)); ---this only have 10 results.

select COL from MYTABLE_INDEXED where contains("COL",'john',fuzzy(0.0)); ---this have 3 million results which cause the speed down tremendously. I should use fuzzy(0.9) instead.


And indexing gives me more reasonable results. The indexing do benefit me.


By the way, may I ask, how can I set search-relevant columns at the creation of the view?

0 Kudos

Hi Fox Lu,

you have to distuinguish between search with and without FULLTEXT index. FULLTEXT index does provide text search on string fields. So the behaviour clearly will change. You also can define the FUZZY-Index without FULLTEXT index on string type columns, but you will only get fuzziness applied on strings, which is much less powerful (no term mappings, no stopwords and much more less).

You will find a whole lot of information related to fuzzy search and text search in the HANA documentation. Also how to define the different types of views, and how, and to which extend, you can define search relevant parameters which each of those views. Most likely you are talking about attribute views, which is the ony view type I am aware of which supports search-relevant stuff. If you are dealing with CDS, it is a different story, as you have to work with annotations there, but again, you will find all the infos in the HANA Documentation most likely.

kyle_mcadam
Active Participant
0 Kudos

Hi Rohit!!

Make sure it is a column store table, and then if the performance still suffers, add an index on the column being searched (second table).

I hope this helps!