cancel
Showing results for 
Search instead for 
Did you mean: 

Finding those records in a child table for which each parent has entries with differing values

VolkerBarth
Contributor
5,393

Say, I have a child table (1:n) and want to find out which (if any) parent has entries with different values in the child table.

Sounds unclear, methinks, so I try to be more concrete:

For example, in the SA 12 sample database, I want to find out whether any Customer has entries in the SalesOrders table with different regions. So, obviously, a GROUP BY is needed, and CustomerID is the parent id.

But how do I "count" the different region entries? - Since there are different entries per CustomerID, a simple count(*) won't do.

Classically, I have used min/max-comparisons for that, e.g.

select CustomerID, min(Region), max(Region) from SalesOrders
group by CustomerID
having min(Region) <> max(Region)
order by 1

But a count(distinct) query might seem more adequate, such as

select CustomerID, count(distinct Region) from SalesOrders
group by CustomerID
having count(distinct Region) > 1
order by 1

However, the latter seems (a bit) worse in performance, possibly as it requires a duplicate elimination. (I respect that it does offer more information, as it really counts the different entries whereas the first query just returns those with at least 2 different entries.)

Is there any preferred version for queries like that (both for reasons of comprehensibility and performance)?

Accepted Solutions (0)

Answers (2)

Answers (2)

Breck_Carter
Participant

It's hard (impossible?) to tell for sure with such a small table, but my first guess that query 2 is not only simpler but faster seems to be born out by the Version 12.0.0.2566 runtimes and the plans after calling sa_flush_cache().

Here's the query I might have written; its plan is more complex than query 2 but the runtime looks the same (BUT AGAIN, the table is too small to tell anything)...

CALL sa_flush_cache();

SELECT CustomerID,
       COUNT(*) AS RegionCount
  FROM ( SELECT DISTINCT 
                CustomerID,
                Region
           FROM SalesOrders ) AS CustomerRegion
 GROUP BY CustomerID
HAVING RegionCount > 1
 ORDER BY 1;
Breck_Carter
Participant
0 Kudos

PS the derived table is there, not for performance, but to help me divide-and-conquer the problem. After writing that, I might have turned it into your query 2 if that turned out to be faster... but I'm not smart enough to come up with query 2 from scratch 🙂

VolkerBarth
Contributor
0 Kudos

Interestingly enough, I'm frequently using the "SELECT col1, count() FROM T GROUP BY col HAVING count() > 1" kind of queries. As such, my query 2 seemed not that complicated. I'm having (a bit) more trouble with the derived query approach...

Breck_Carter
Participant

Your query 2 uses COUNT ( DISTINCT column ) together with GROUP BY different-column which is considerably more sophisticated (IMO) than a simple SELECT column, COUNT() GROUP BY same-column... which I, too, use all time. Derived tables are great because they let you solve part of the problem first, and test the solution separately: "Now I have all the distinct CustomerID - Region pairs in a derived table, now I can write an ordinary simple SELECT column, COUNT() GROUP BY same-column just like I (and you) frequently do.

Breck_Carter
Participant
0 Kudos

Having said that, in a million-row table query 2 may run faster... or, maybe, the plans will look completely different with more rows. The bottom line (the answer to your original question) is that there is NOTHING WRONG with query 2, at least in the tiny database case.

Breck_Carter
Participant

FWIW I have babbled about "divide and conquer" on a few occasions, in this blog post and the others it links to: http://sqlanywhere.blogspot.com/2008/10/divide-and-conquer-by-working-backwards.html

VolkerBarth
Contributor
0 Kudos

@Breck: I generally prefer the divide-and-conquer-aprroach, too, and derived tables are very useful here - so I fully agree. - Having said that, it's obviously a question of one's own usage of SQL to tell whether a particular solution is more or less comprehensible. That may be a good or bad situation:)

0 Kudos

This is a nasty hack I've seen used before, I can't remember using it though.

select CustomerID,Total from  
(select CustomerID,sum(cnt) as Total from 
(select distinct CustomerID,Region,1 as cnt from salesorders) as foo
group by CustomerID) as boo where Total > 1

Hmmm after reading the question again, I'm not sure this meets either comprehensibility or performance 🙂