on 2010 Oct 14 1:12 PM
Say, I have a child table (1:n) and want to find out which (if any) parent has entries with different values in the child table.
Sounds unclear, methinks, so I try to be more concrete:
For example, in the SA 12 sample database, I want to find out whether any Customer has entries in the SalesOrders table with different regions. So, obviously, a GROUP BY is needed, and CustomerID is the parent id.
But how do I "count" the different region entries? - Since there are different entries per CustomerID, a simple count(*) won't do.
Classically, I have used min/max-comparisons for that, e.g.
select CustomerID, min(Region), max(Region) from SalesOrders
group by CustomerID
having min(Region) <> max(Region)
order by 1
But a count(distinct) query might seem more adequate, such as
select CustomerID, count(distinct Region) from SalesOrders
group by CustomerID
having count(distinct Region) > 1
order by 1
However, the latter seems (a bit) worse in performance, possibly as it requires a duplicate elimination. (I respect that it does offer more information, as it really counts the different entries whereas the first query just returns those with at least 2 different entries.)
Is there any preferred version for queries like that (both for reasons of comprehensibility and performance)?
Request clarification before answering.
It's hard (impossible?) to tell for sure with such a small table, but my first guess that query 2 is not only simpler but faster seems to be born out by the Version 12.0.0.2566 runtimes and the plans after calling sa_flush_cache().
Here's the query I might have written; its plan is more complex than query 2 but the runtime looks the same (BUT AGAIN, the table is too small to tell anything)...
CALL sa_flush_cache(); SELECT CustomerID, COUNT(*) AS RegionCount FROM ( SELECT DISTINCT CustomerID, Region FROM SalesOrders ) AS CustomerRegion GROUP BY CustomerID HAVING RegionCount > 1 ORDER BY 1;
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Your query 2 uses COUNT ( DISTINCT column ) together with GROUP BY different-column which is considerably more sophisticated (IMO) than a simple SELECT column, COUNT() GROUP BY same-column... which I, too, use all time. Derived tables are great because they let you solve part of the problem first, and test the solution separately: "Now I have all the distinct CustomerID - Region pairs in a derived table, now I can write an ordinary simple SELECT column, COUNT() GROUP BY same-column just like I (and you) frequently do.
FWIW I have babbled about "divide and conquer" on a few occasions, in this blog post and the others it links to: http://sqlanywhere.blogspot.com/2008/10/divide-and-conquer-by-working-backwards.html
@Breck: I generally prefer the divide-and-conquer-aprroach, too, and derived tables are very useful here - so I fully agree. - Having said that, it's obviously a question of one's own usage of SQL to tell whether a particular solution is more or less comprehensible. That may be a good or bad situation:)
This is a nasty hack I've seen used before, I can't remember using it though.
select CustomerID,Total from
(select CustomerID,sum(cnt) as Total from
(select distinct CustomerID,Region,1 as cnt from salesorders) as foo
group by CustomerID) as boo where Total > 1
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hmmm after reading the question again, I'm not sure this meets either comprehensibility or performance 🙂
User | Count |
---|---|
60 | |
8 | |
7 | |
6 | |
6 | |
5 | |
5 | |
4 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.