on 2021 Jun 16 8:48 PM
Hello!
I've been working lately on some performance improvement of one of the queries that we have exposed as oData endpoint (it's a CalcView that basically calls SQL Function).
My data has following structure:
-- 75m records
CREATE COLUMN TABLE PERSON_SEARCH (
"PERSON_ID" NVARCHAR(40),
"FIRST_NAME" NVARCHAR(500) FUZZY SEARCH INDEX ON,
"LAST_NAME" NVARCHAR(500) FUZZY SEARCH INDEX ON );
CREATE INDEX idx_person_first_name ON PERSON_SEARCH (FIRST_NAME);
CREATE INDEX idx_person_last_name ON PERSON_SEARCH (LAST_NAME);
-- 23m records
CREATE COLUMN TABLE PERSON (
"ID" NVARCHAR(40),
"FIRST_NAME" NVARCHAR(500),
"LAST_NAME" NVARCHAR(500),
"YEAR_OF_BIRTH" SMALLINT,
PRIMARY KEY ("ID") );
-- 23m records
CREATE COLUMN TABLE PERSON_TO_ADDRESS (
"ADDRESS_ID" NVARCHAR(40) NOT NULL,
"PERSON_ID" NVARCHAR(40) NOT NULL );
CREATE INDEX idx_person_to_address_address_id ON PERSON_TO_ADDRESS (ADDRESS_ID);
CREATE INDEX idx_person_to_address_person_id ON PERSON_TO_ADDRESS (PERSON_ID);
-- 26m records
CREATE COLUMN TABLE ADDRESS_DETAILS (
"ID" NVARCHAR(40) NOT NULL,
"COUNTRY_CODE" NVARCHAR(2) FUZZY SEARCH INDEX ON,
"CITY" NVARCHAR(255) FUZZY SEARCH INDEX ON,
"ZIP" NVARCHAR(255) FUZZY SEARCH INDEX ON,
"STREET" NVARCHAR(1000) FUZZY SEARCH INDEX ON,
PRIMARY KEY ("ID") );
CREATE INDEX idx_address_details_street ON ADDRESS_DETAILS (STREET);
CREATE INDEX idx_address_details_city ON ADDRESS_DETAILS (CITY);
CREATE INDEX idx_address_details_country_code ON ADDRESS_DETAILS (COUNTRY_CODE);
CREATE INDEX idx_address_details_zip ON ADDRESS_DETAILS (ZIP);
-- 29m records
CREATE COLUMN TABLE PERSON_FUNCTION (
"PERSON_ID" NVARCHAR(40) NOT NULL,
"FUNCTION_ID" NVARCHAR(255),
"FUNTION_DESCRIPTION" NVARCHAR(255));
CREATE INDEX idx_person_function_person_id ON PERSON_FUNCTION (PERSON_ID);
and there is an additional VIEW that does some concatenation etc for table with following structure:
named V_PERSON_SCORE that returns PERSON_ID, SCORE and SCORE_CATEGORY (depending on some aggregations - possible values are LOW, MEDIUM, HIGH or NULL)
-- 65k records
CREATE COLUMN TABLE PERSON_SCORE (
"PERSON_ID" NVARCHAR(36) NOT NULL,
"PARTIAL_SCORE" DOUBLE CS_DOUBLE NOT NULL );
and my query looks like following:
SELECT DISTINCT
person.id AS id,
person.first_name AS first_name,
person.last_name AS last_name,
address.city AS city,
address.country_code AS country_code,
address.street AS street,
address.zip AS zip,
person.year_of_birth AS year_of_birth,
function.function_description AS function_description,
CASE
WHEN ROUND(MAX(COALESCE(p_score.score,0)), 3) < 0.200 THEN 0
ELSE 1
END AS per_score,
MAX(TO_DECIMAL(SCORE(),3,2)*100) AS score
FROM PERSON_SEARCH AS person_history
INNER JOIN PERSON AS person
ON person.id = person_history.person_id
INNER JOIN PERSON_TO_ADDRESS AS person_to_address
ON person_to_address.entity_id = person.id
INNER JOIN ADDRESS_DETAILS AS address
ON address.id = person_to_address.address_id
INNER JOIN PERSON_FUNCTION AS function
ON function.person_id = person.id
LEFT JOIN V_PERSON_SCORE AS p_score
ON person.id = p_score.node_id
WHERE CONTAINS(person_history.first_name, :ip_first_name, FUZZY(0.7, 'similarCalculationMode=substringsearch'))
AND CONTAINS(person_history.last_name, :ip_last_name, FUZZY(0.7, 'similarCalculationMode=substringsearch'))
AND CONTAINS(address.street, :ip_street, FUZZY(0.7, 'similarCalculationMode=symmetricsearch'))
AND CONTAINS(address.city, :ip_city, FUZZY(0.7, 'similarCalculationMode=symmetricsearch'))
AND CONTAINS(address.country_code, :ip_country_code)
AND CONTAINS(address.zip, :ip_zip)
AND COALESCE(person.year_of_birth, 1) BETWEEN 0 AND 9999
AND CONTAINS(function.function_id, :ip_functions)
GROUP BY
person.id,
person.first_name,
person.last_name,
address.city,
address.country_code,
address.street,
address.zip,
person.year_of_birth,
function.function_description,
p_score.score,
SCORE()
ORDER BY SCORE() DESC;
ip_functions - BOSS, OWNER, EMPLOYEE, RETIREE - which are the functions that we look for. passed in format 'BOSS OR OWNER'
I've added indexes on all recommended columns that were listed by plan generated by Analyze SQL and Save Plan (WebIDE).
Generally speaking performance looks very good when we give any of parameters: ip_first_name, ip_last_name, ip_street, ip_zip but when those are missing and we give anything from lt_p_score or country_code performance looks very bad.
Quick example:
- ip_city as 'Berlin', ip_functions as 'EMPLOYEE OR BOSS OR OWNER' there are 152497 records returned in 1687ms.
- ip_zip as '60306', ip_functions as 'EMPLOYEE OR BOSS OR OWNER' there are 29 results returned in 154ms.
- ip_last_name as 'Mark', ip_functions as 'EMPLOYEE OR BOSS OR OWNER'there are 201561 records returned in 1925ms.
and those times (even if not perfect are fine for me).
Problem starts when:
- country_code as 'DE', ip_functions as 'EMPLOYEE OR BOSS OR OWNER' there are 4282557 records returned in 33283ms.
- country_code as 'DE', ip_functions as 'EMPLOYEE' there are 3142952 records returned in 32100ms
which are quite problematic in the point of view of the user.
Is there any way to improve the performance of my statement?
I was thinking on creating one table that will hold all of the data and will be refreshed from time to time (1x per 1h for example) but it seems completely redundant and there probably is a better way to do it, just i don't see it ...
Would be really grateful for any help or feedback about what could be improved here.
You wrote that returning 4.2 Mio records takes ~33 secs, which equates to roughly 130k records per second.
What is the usage scenario for that volume of data to be returned? Large result sets are a common performance issue that cannot be improved by adding indexes etc.
Concerning the indexes the OP added:
There seems to be confusion about FULLTEXT indexes and the ones created via CREATE INDEX. The latter does not improve fuzzy search performance and should likely be removed again.
A slight improvement in terms of filtering could be gained by ensuring that YEAR_OF_BIRTH is NOT NULL and an INTEGER datatype (HANA does not implement SMALLINT differently than INT, so there are no "savings" in there). This will make the convoluted WHERE condition simpler.
The join against PERSON_SCORE will trigger a conversion for PERSON_ID (length 40 / 36) which is never helpful.
Finally, the DISTINCT keyword is superfluous as the result set is GROUPED anyhow.
Overall though: reduce/limit the result set size as much as possible if total execution time is of concern.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello,
thank you for reply!
Result of this query is exposed via CalcView as odata service as a data feed to one of the search reports. Generally speaking most of users are providing "normal" values as parameters - so the amount of returned records won't be above 20k but there are some users that would for example "check all companies in US" ... hence they are providing only country code as US and expect some results. As I said above, they are getting it but it's just quite "slow".
1. I don't really need FULLTEXT indexes since I am using only simple substring search (that's why I went for simple FUZZY SEARCH IDNEX ON on String column). Additional indexes were added after going through Fuzzy Search SAP HELP where they said "Additional performance improvements are possible when creating database indexes on the columns."
CREATE INDEX myindex1 ON mytable(col1);
CREATE INDEX myindex2 ON mytable(col2);
2. Unfortunately I am not able to do it
3. As I am able to see after few tests it seems that what is helping a bit is removing WHERE clause and going with APPLY_FILTER. Since in the example that I listed not all parameters are passed (actually if 3 or more are passed then performance is very well) I am able to reduce some on the filters which on the first glance shouldn't improve the performance (more filter almost always = better performance) which seems not to be a case for fuzzy searches when '*' (wildcard) is given.
CONTAINS(person_history.first_name, :ip_first_name, FUZZY(0.7, 'similarCalculationMode=substringsearch'))
AND CONTAINS(person_history.last_name, :ip_last_name, FUZZY(0.7, 'similarCalculationMode=substringsearch'))
AND CONTAINS(address.street, :ip_street, FUZZY(0.7, 'similarCalculationMode=symmetricsearch'))
AND CONTAINS(address.city, :ip_city, FUZZY(0.7, 'similarCalculationMode=symmetricsearch'))
AND CONTAINS(function.function_id, :ip_functions)
Example:
Let's say i have where clause as above and
I am getting 152497 records returned in ~1500ms
When I change where clause to
AND CONTAINS(address.city, :ip_city, FUZZY(0.7, 'similarCalculationMode=symmetricsearch'))
AND CONTAINS(function.function_id, :ip_functions)
and run it with the same ip_city and ip_functions as above I am getting the same result (since wildcards doesn't really matter) in ~650ms which seems to be large improvement.
Which would lead me to having table variable without any WHERE clause and checking for each parameter if it is <> '*', if yes then APPLY FILTER on this column otherwise leave it alone.
Short feedback on that:
1. FUZZY SEARCH indexes can help with FUZZY searches - "normal" indexes simply won't.
3. Wildcard searches (even with "*") against NULLABLE columns need to be executed/computed to ensure correct result sets. That may lead to the same result as not executing the searches, but the DB has no way of knowing that.
All in all, the main cause for the "weak" performance is that users can ask for arbitrary large result sets and the application naively passes this through to the DB.
There are better ways to deal with "show me everything", e.g. pagination (ie. google UI) or faceted search with aggregation.
Thanks again!
1. Will delete them then and test it again ( not sure why it is listed in official docs as something that might speed it up).
3. Got it (then APPLY_FILTER will be helpful as I said)
Generally it is paginated (I'm obviously not returning all X m records) but since its sorted by SCORE() it still has to be executed as a whole.
Per faceted search with aggregation - does it simply mean that I should test it with /Facets like showed here instead of /Search every time? (checking in UI5 which parameters are provided and then running either /Search if those that will filter out a lot of results are given or /Facets /facetslimit when something generic is provided what will still return a lot of stuff) ?
SCORE needs to be calculated, but that should be possible before/without full materialisation of the result set.
That means, the pagination - which currently seems to be done in the UI by simply ignoring the majority of computed result records - could be implemented on DB level, e.g. via LIMIT.
Concerning faceted search: no, this does not just mean adding some query parameters but providing a faceted search UI instead.
The goal is to lead the user to the records of interest faster, without producing result sets with millions of records.
This is not a query tuning exercise but an exercise in helping users to ask sensible questions.
Understood.
As I mentioned this functions is called in Calculation View that is exposed as oData service (which basically support top & skip out of the box but t seems that it doesn't "push it down").
I'm accepting your answer as it is providing a lot of insightful informations.
Thanks for your help and time!
User | Count |
---|---|
71 | |
10 | |
8 | |
7 | |
7 | |
6 | |
6 | |
6 | |
6 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.