cancel
Showing results for 
Search instead for 
Did you mean: 

What's the best way to list those rows of a table that do not exist in another one?

VolkerBarth
Contributor
4,206

This is a late follow-up on this question:

Say, I have two tables T1 and T2 with identical schema and nearly identical contents in the same database. They have a primary key pk.

Now I want to list those rows from table T1 (i.e. select T1.*) that don't exist at all in T2 - i.e. T1's primary key doesn't exist in T2.

I can think of three four different ways to do so (and I'm gonna list them as answers...). Is there any preferred way among them w.r.t. to performance? Or is there another, better approach?

Or should I prefer the way that does seem most logical/comprehensible to me?

(Yes, Breck, of course I should...)

In my tests, the queries do use different plans and perform not identical well (and I'm using sa_flush_cache()) in between). But I would like if there's a general answer or at least a rule-of-thumb.

The result sets of approaches A-D should be identical.

FWIW, some test results with 12.0.1.3389 on a "slow" box:

A: 1750 - 1875 ms
B: 3312 - 3516 ms
C: 2203 - 2343 ms
😧 1328 - 1563 ms

So my current personal preference is A, both because of semantics and performance...

0 Kudos

I am looking forward to the answer here because I usually pick B which evidently performs worst.

VolkerBarth
Contributor
0 Kudos

Any hints to general performance hints from the Sybase query experts like Glenn or Ivan - or does it read "it depends"?

VolkerBarth
Contributor
0 Kudos

Anybody from the Alpha class listening?

Cf. Breck's blog

VolkerBarth
Contributor
0 Kudos

Yes, I'm still interested in more opinions...

From the current remarks, I would think that SQL language features like EXCEPT still seem rather unfamiliar, and that therefore one would prefer not to use them - from a practical point of view: Even if oneself does understand them, the next guy might not.

I do appreciate this opinion, but such kind of thinking might freeze the SQL usage to something like SQL/92 in the long run...

VolkerBarth
Contributor
0 Kudos

Glenn, Ivan, anybody - any hints would still be highly appreciated ... particularly as the public opinion seems to prefer those approaches that do perform worse in this case... (which is alright with me, of course!).

See the tag I've added:)

Accepted Solutions (0)

Answers (5)

Answers (5)

VolkerBarth
Contributor

Approach C: Use a left join and filter out the joined rows

select T1.* from T1
   left join T2 on T1.pk = T2.pk
where T2.pk is null
order by T1.pk

Personally, I think of this as a rather "tricky" (i.e. incomprehensible) approach.

Former Member
0 Kudos

I like this approach best.. it's the cleanest to my fuzzy brain, although I would probably typically write it as you did in Approach B because that's the way I'd incrementally think it through. The disadvantage with approach A is that the next guy that comes along to work in your code is not going to understand it.

Now, I'd like to think this is an ad-hoc query, or at least a daily or weekly report, rather than a real production frequently run query. In case this assumption is correct, the diference in processing time does't matter.

I read somewhere that computer code is first meant to be read by humans, not computers. I agree with that. your code needs to be intelligible to the next guy... the computer just doesn't care how eloquent you are.

VolkerBarth
Contributor
0 Kudos

@Ron: You're correct that this has come up while doing some ad-hoc queries. Nevertheless, it's a general question that I have beared for a long time, as the according query task used often as "building block" of more complex queries...

Former Member
0 Kudos

to me the way to go is an outer join with a where clause like C. The left join lists all matches in T.2 as well as all rows from T1. I would think this would be the fastest, but i would have to do some testing. it's also the least complicated which in my book most of the time is the way to go.

VolkerBarth
Contributor

Approach B: Use a NOT IN subquery

select * from T1
where pk not in
(select pk from T2)
order by pk

That might be the "classical" approach.

Former Member

I think, that the best approach depends on data statistics in both tables. I am personally prefer human readable, clear and easy to understand sql queries. Personally for me, approach C (leader) and approach B is much easy to read and understand. But,data statistics is very important.

VolkerBarth
Contributor
0 Kudos

Approach 😧 Use a JOIN with a derived table with EXCEPT

select T1.* from T1 inner join
  (select pk from T1
   except
   select pk from T2) DT
   on T1.pk = DT.pk
order by T1.pk

I thought this was just a rewrite of approach A but it performs different (and better in my tests).

VolkerBarth
Contributor
0 Kudos

See the notes on EXCEPT ALL vs. EXCEPT on approach A.

VolkerBarth
Contributor
0 Kudos

Approach A: Use an IN subquery with a table with EXCEPT

select * from T1
where pk in
(select pk from T1
 except
 select pk from T2)
order by pk

I guess by "subqery-flattening" this will be transformed internally to the following JOIN with a derived table with EXCEPT variant (which is less comprehensible IMHO).

CORRECTION: My assumption about this transformation seem (partly) wrong. Whereas calling PLAN() on both queries yields similar results, calling REWRITE() leads to different optimizations, and the actual query execution time is different, as well. Therefore I turn the JOIN version into a forth answer as approach D.

I guess the set operators EXCEPT and INTERSECT are still somewhat underestimated. Personally, I like to use them as they cope fine with SQL's set-based logic.

VolkerBarth
Contributor
0 Kudos

For performance reasons, one might use "EXCEPT ALL" instead of "EXCEPT" here - by definition the "select pk from T1" cannot contain duplicates (it simply selects the PRIMARY KEY), so the default "Duplicate elimination algorithm" use by EXCEPT would be unnecessary...