cancel
Showing results for 
Search instead for 
Did you mean: 

Performance issue with mix of left outer and inner joins

1,189

Hi All,

I am working on a performance issue involving a select query with a mix of left outer join and inner joins.

SQLA configuration:

OS: Windows server 2012 R2 standard
Page size: 8K
SQL Anywhere version: 12.0.1.4124
Edition: workgroup
Cache size: 80 GB

Below is the query where I have replaced the original table names with alternate names as I am not allowed to share the original query:

Query:

SELECT ltbl2.col1,
       ltbl1.*,
       ltbl3.col1,
       ltbl3.col2,
       view1.col1,
       view1.col2,
       view1.col3
FROM   ltbl1 left outer join view1 on ltbl1.key1 =  view1.key, 
       ltbl2, 
       ltbl3
WHERE (ltbl1.key2 = ltbl3.key ) AND
      ( ltbl1.key3 = ltbl2.key ) AND
      ( ltbl3.col1 not in ( 1,6 ) ) AND
      ( ltbl1.date_column is not null ) AND
      ( ltbl1.key2 = 600260151039) AND
      ( convert(varchar(1), ltbl1.col1) <> '5')

Row counts:

ltbl1 - 13989140
ltbl2 - 329159
ltbl3 - 607

Issue Description:

Scenario 1: When we include only the first left outer join between table - ltbl1 and view - view1, the optimizer estimate - "RowsReturned" is approximately 80% correct (estimate 7054, actual 8936) for table - ltbl1.

Scenario 2: If we add the second table - ltbl2 with an inner join into the first left outer join, the estimate of the - "RowsReturned" deteriorate very quickly to almost 0% (estimate 63, actual 8936).

Scenario 3: If we add the third table - ltbl3 with an inner join into the previous mix of left outer join and inner join the estimate - RowsReturned further deteriorates to 34 vs 8936.

When I view the query plan in all the three scenarios, I can see that the optimizer chooses to use an index scan. The index happens to be the one created for the foreign key constraint between ltbl1 and ltbl2. The index statistics section of all three plans shows the estimate and actual for RowsReturned are same.

The post scan predicates and scan predicates section also shows column and join statistics being used and they evaluate to true 100%.

I can see that the inner join is messing up the estimate but unable to answer why the optimizer estimate for the largest table - ltbl1 continues downhill when inner joins are added.

NOTE: If we change the inner joins for the remaining two table(s) ltbl2 and ltbl3 to a left outer join the optimizer estimates for ltbl1 RowsReturned is back to normal (estimate 7054, actual 8936).

Could someone please help me understand what is happening when we add inner joins into a left outer join and the effect of it on the optimizer estimates?

Many thanks.

VolkerBarth
Contributor

I'd HIGHLY recommend to use the INNER JOIN syntax for inner joins instead of the very old-fashioned combination of cross-joining tables via comma-separated table lists in the FROM clause and join conditions in the WHERE clause.

It would immediately make your statement much more comprehensible.

Without the comma operator, it's also much easier to use brackets to group joins.

0 Kudos

Thanks Volker for the recommendation. I modified the query to use inner joins instead of the comma notation but the query plan still shows very poor estimate for RowsReturned for ltbl1 (34 vs 8936).

When I change all the joins in the above query to left outer joins, the query plan makes use of left outer nested loop joins and the join order is also different - ltbl1, ltbl3, ltbl2.

For the original query provided above with the commas replaced by inner joins, I see that the optimizer chooses to use nested loop joins and the join order is ltbl2, ltbl1, ltbl3. The table ltbl2 is about a half million and my understanding is that for each row in ltbl2 the largest table ltbl1 is parsed once which is worse.

Please help me understand how the optimizer makes use of statistics when joining table(s) and how it changes for different types of joins. The documentation mentions join statistics will be used which is a combination of the column statistics of the columns used in the join clause but doesn't explain exactly how it is used.

Is mixing left outer join with inner join a bad practice or is there a rule of thumb that should be followed that aids optimizer in making right decisions?

Thanks.

VolkerBarth
Contributor

Is mixing left outer join with inner join a bad practice or is there a rule of thumb that should be followed that aids optimizer in making right decisions?

Mixing inner and left joins is surely very common practise. Personally, I mostly prefer to "start" the FROM clause with inner joins and then add left joins.

For further advice, it would be very helpful to show us HOW excalty your query looks like with inner join syntax and whether you use brackets or not. The statement "When I change all the joins in the above query to left outer joins" is too vague.

0 Kudos

Thanks Volker for your feedback.

The below query using only left outer joins performs well and the optimizer estimate for RowsReturned too is approximately 80% correct(estimate 7054, actual 8936) for ltbl1:

SELECT ltbl2.col1,
       ltbl1.*,
       ltbl3.col1,
       ltbl3.col2,
       view1.col1,
       view1.col2,
       view1.col3
FROM   ltbl1 left outer join view1 on ltbl1.key1 =  view1.key
       left outer join ltbl3 on ltbl3.key = ltbl1.key2
       left outer join ltbl2 on ltbl2.key = ltbl1.key3
WHERE ( ltbl3.col1 not in ( 1,6 ) ) AND
      ( ltbl1.date_column is not null ) AND
      ( ltbl1.key2 = 600260151039) AND
      ( convert(varchar(1), ltbl1.col1) <> '5')

I tried modifying the original query to use inner joins instead of commas and the optimizer estimates remain poor(estimate 34, actual 8936):

SELECT ltbl2.col1,
       ltbl1.*,
       ltbl3.col1,
       ltbl3.col2,
       view1.col1,
       view1.col2,
       view1.col3
FROM   ltbl1 inner join ltbl3 on ltbl3.key = ltbl1.key2
       inner join ltbl2 on ltbl2.key = ltbl1.key3
       left outer join view1 on ltbl1.key1 =  view1.key
WHERE ( ltbl3.col1 not in ( 1,6 ) ) AND
      ( ltbl1.date_column is not null ) AND
      ( ltbl1.key2 = 600260151039) AND
      ( convert(varchar(1), ltbl1.col1) <> '5')

If I remove the ltbl2 from the above query, the estimates turn out to be better. I am not exactly sure how adding ltbl2 with an inner join is creating wrong/incorrect estimates for ltbl1.

Please let me know if more details are required.

Thanks.

VolkerBarth
Contributor

As there is only one table to be outer-joined (view1) and there are no further joins on that, IMHO you could as well use

FROM   ltbl1
       left  join view1 on ltbl1.key1 = view1.key
       inner join ltbl3 on ltbl1.key2 = ltbl3.key
       inner join ltbl2 on ltbl1.key3 = ltbl2.key 

Unless you use brackets, joins are evaluated from left to right.

In the end, are the estimates a problem or the actual query performance?

In case the optimizer chooses poor plans continuously, as a last resort, you might also add an INDEX HINT to force the use of a particular key/index.

Accepted Solutions (0)

Answers (1)

Answers (1)

awitter
Participant
0 Kudos

The where clausule has this: ltbl3.col1 not in ( 1,6 )

This means it canot use the index optimally (at least not as expected) (can be rewritten as ltbl3.col1 <> 1 and ltbl3.col1 not in ( 1,6 ) <> 6 )

Maybe start cleaning up your query: SELECT ltbl2.col1, ltbl1.*, ltbl3.col1, ltbl3.col2, view1.col1, view1.col2, view1.col3 FROM ltbl1 inner join ltbl2 on ltbl2.key = ltbl1.key3 inner join ltbl3 on ltbl3.key = 600260151039 and ltbl3.col1 not in ( 1,6 ) left outer join view1 on view1.key = ltbl1.key1 WHERE ( ltbl1.date_column is not null ) AND ( ltbl1.key2 = 600260151039) AND ( convert(varchar(1), ltbl1.col1) <> '5')

Now you see that the nner join to ltbl3 really isn't an inner join, but just an added table with parameters. so I still wonder if the joins as stated are correct (key2 and key 3 interchanged?):

FROM   ltbl1 
       inner join ltbl2 on ltbl2.key = ltbl1.key3
       inner join ltbl3 on ltbl3.key = ltbl1.key2 and ltbl3.col1 not in ( 1,6 )
       left outer join view1 on view1.key = ltbl1.key1

Also, have you tried to reset the statistics for these tables? After a few inserts and deletes it might be that the statistics are out-of-line. look for statement Reset statistics...