on 2010 Mar 29 11:19 PM
I have a table consisting of a record_id, email_address, and a handful of other attributes....
CREATE TABLE "mailing_list" (
"record_id" integer NOT NULL DEFAULT autoincrement,
"email_address" long varchar NULL,
"dupe_flag" char(1) NULL,
/* and a bunch more */
PRIMARY KEY ( "record_id" )
);
I have an index as follows.. CREATE INDEX "email_record_id" ON "mailing_list" ( "email_address" ASC, "record_id" ASC ) IN "indexes";
I have duplicate records in there -- more than one record per email address. I want to delete all duplicates, saving the one with the largest record ID. This works pretty well...
UPDATE mailing_list tableA join mailing_list tableB
ON tableA.email_address = TableB.email_address
AND TableA.record_id > TableB.record_id
set TableB.dupe_flag = 'y';
DELETE FROM mailing_list WHERE dupe_flag = 'Y';
I'm joining the table to itself on email address, and the record ID bit is what preserves the first record of a set of duplicates, so all the others get marked for deletion.
I'll go out on a limb and say it works perfectly if my table has 100,000 records. It runs in about 10-12 seconds.
But the problem is that if I run it over a larger set (500,000 records in my next step up), it apparently gets bogged down -- I kill it when we reach 700 seconds. I need to be able to run it over a couple million records.
I can quickly identify my dupes (and also the record_id that I want to keep from a set with duplicates) with this...
SELECT email_address, max(record_id), count(*) as dupe_count
FROM mailing_list
GROUP BY email_address
HAVING dupe_count > 1
But while this query identifies the keepers from a group of dupes, it does not identify the keepers where there are no dupes.
So... I need a efficient way to pull this together... probably using a GROUP BY instead of JOINing two big record sets.
Thanks!
p.s. Um, no, I'm not a spammer! This is for opt-in lists! People have opted in multiple times, and we don't want to send them multiple emails!
p.p.s.... wait... don't answer this! I just found the answer! I'll be right back.
Can you check the performance of the following statement ?
DELETE FROM mailing_list
WHERE record_id IS NOT IN (
/* derived table of just keepers */
SELECT max(record_id) as keeper
FROM mailing_list
GROUP BY email_address
)
This would avoid the update of records that need to stay. And to clear the flag later on.
Thomas
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Well, after writing all that up, and re-reading it, it souded a bit like my question of week or two ago...
And Mark Culp gave the answer there to this question too! Thanks Mark! You rock!
The answer is to use a derived table... and instead of trying to identify the records to delete, just identify the keepers, and delete everything else.
Update mailing_list ml
set dupe_flag = 'N'
from (
/* derived table of just keepers */
SELECT email_address, max(record_id) as keeper
FROM mailing_list
GROUP BY email_address ) dt
where ml.record_id = dt.keeper ;
DELETE FROM mailing_list WHERE dupe_flag is null;
That does it, and it processes my 500,000 test set in 50 seconds. I haven't scaled bigger than that yet.
Update: I just ran this against 1.27 million records in 141 seconds! It seems to scale up nicely!
Ron
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
If you run this type of update more than once, and new records can be added between runs, then you will need to reset your dupe_flag to null before running your update ... set dupe_flag = 'N'. ... because obviously a record that use to be max(record_id) in the last run may no longer be the max when run again.
User | Count |
---|---|
60 | |
10 | |
7 | |
7 | |
6 | |
6 | |
5 | |
5 | |
5 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.