on 2011 Aug 26 2:18 PM
This is a late follow-up on this question:
Say, I have two tables T1 and T2 with identical schema and nearly identical contents in the same database. They have a primary key pk.
Now I want to list those rows from table T1 (i.e. select T1.*) that don't exist at all in T2 - i.e. T1's primary key doesn't exist in T2.
I can think of three four different ways to do so (and I'm gonna list them as answers...). Is there any preferred way among them w.r.t. to performance? Or is there another, better approach?
Or should I prefer the way that does seem most logical/comprehensible to me?
(Yes, Breck, of course I should...)
In my tests, the queries do use different plans and perform not identical well (and I'm using sa_flush_cache()) in between). But I would like if there's a general answer or at least a rule-of-thumb.
The result sets of approaches A-D should be identical.
FWIW, some test results with 12.0.1.3389 on a "slow" box:
A: 1750 - 1875 ms
B: 3312 - 3516 ms
C: 2203 - 2343 ms
😧 1328 - 1563 ms
So my current personal preference is A, both because of semantics and performance...
Approach C: Use a left join and filter out the joined rows
select T1.* from T1 left join T2 on T1.pk = T2.pk where T2.pk is null order by T1.pk
Personally, I think of this as a rather "tricky" (i.e. incomprehensible) approach.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I like this approach best.. it's the cleanest to my fuzzy brain, although I would probably typically write it as you did in Approach B because that's the way I'd incrementally think it through. The disadvantage with approach A is that the next guy that comes along to work in your code is not going to understand it.
Now, I'd like to think this is an ad-hoc query, or at least a daily or weekly report, rather than a real production frequently run query. In case this assumption is correct, the diference in processing time does't matter.
I read somewhere that computer code is first meant to be read by humans, not computers. I agree with that. your code needs to be intelligible to the next guy... the computer just doesn't care how eloquent you are.
@Ron: You're correct that this has come up while doing some ad-hoc queries. Nevertheless, it's a general question that I have beared for a long time, as the according query task used often as "building block" of more complex queries...
to me the way to go is an outer join with a where clause like C. The left join lists all matches in T.2 as well as all rows from T1. I would think this would be the fastest, but i would have to do some testing. it's also the least complicated which in my book most of the time is the way to go.
Approach B: Use a NOT IN subquery
select * from T1 where pk not in (select pk from T2) order by pk
That might be the "classical" approach.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I think, that the best approach depends on data statistics in both tables. I am personally prefer human readable, clear and easy to understand sql queries. Personally for me, approach C (leader) and approach B is much easy to read and understand. But,data statistics is very important.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Approach 😧 Use a JOIN with a derived table with EXCEPT
select T1.* from T1 inner join (select pk from T1 except select pk from T2) DT on T1.pk = DT.pk order by T1.pk
I thought this was just a rewrite of approach A but it performs different (and better in my tests).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Approach A: Use an IN subquery with a table with EXCEPT
select * from T1 where pk in (select pk from T1 except select pk from T2) order by pk
I guess by "subqery-flattening" this will be transformed internally to the following JOIN with a derived table with EXCEPT variant (which is less comprehensible IMHO).
CORRECTION: My assumption about this transformation seem (partly) wrong. Whereas calling PLAN() on both queries yields similar results, calling REWRITE() leads to different optimizations, and the actual query execution time is different, as well. Therefore I turn the JOIN version into a forth answer as approach D.
I guess the set operators EXCEPT and INTERSECT are still somewhat underestimated. Personally, I like to use them as they cope fine with SQL's set-based logic.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
75 | |
10 | |
10 | |
10 | |
10 | |
9 | |
8 | |
7 | |
5 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.