cancel
Showing results for 
Search instead for 
Did you mean: 

How do I de-dupe a large record set?

Former Member
3,335

I have a table consisting of a record_id, email_address, and a handful of other attributes....

CREATE TABLE "mailing_list" (
"record_id" integer NOT NULL DEFAULT autoincrement,
"email_address" long varchar NULL,
"dupe_flag" char(1) NULL,
/* and a bunch more */
PRIMARY KEY ( "record_id" )
);

I have an index as follows.. CREATE INDEX "email_record_id" ON "mailing_list" ( "email_address" ASC, "record_id" ASC ) IN "indexes";

I have duplicate records in there -- more than one record per email address. I want to delete all duplicates, saving the one with the largest record ID. This works pretty well...

UPDATE mailing_list tableA join mailing_list tableB  
    ON tableA.email_address =  TableB.email_address 
      AND TableA.record_id > TableB.record_id   
    set TableB.dupe_flag = 'y';

DELETE FROM mailing_list WHERE dupe_flag = 'Y';

I'm joining the table to itself on email address, and the record ID bit is what preserves the first record of a set of duplicates, so all the others get marked for deletion.

I'll go out on a limb and say it works perfectly if my table has 100,000 records. It runs in about 10-12 seconds.

But the problem is that if I run it over a larger set (500,000 records in my next step up), it apparently gets bogged down -- I kill it when we reach 700 seconds. I need to be able to run it over a couple million records.

I can quickly identify my dupes (and also the record_id that I want to keep from a set with duplicates) with this...

 SELECT  email_address, max(record_id), count(*) as dupe_count  
 FROM mailing_list 
 GROUP BY email_address 
 HAVING dupe_count > 1 

But while this query identifies the keepers from a group of dupes, it does not identify the keepers where there are no dupes.

So... I need a efficient way to pull this together... probably using a GROUP BY instead of JOINing two big record sets.

Thanks!

p.s. Um, no, I'm not a spammer! This is for opt-in lists! People have opted in multiple times, and we don't want to send them multiple emails!

p.p.s.... wait... don't answer this! I just found the answer! I'll be right back.

Accepted Solutions (0)

Answers (2)

Answers (2)

thomas_duemesnil
Participant

Can you check the performance of the following statement ?

DELETE FROM mailing_list 
WHERE record_id IS NOT IN ( 
       /* derived table of just keepers */   
      SELECT  max(record_id) as keeper  
          FROM mailing_list 
             GROUP BY email_address 
) 

This would avoid the update of records that need to stay. And to clear the flag later on.

Thomas

Former Member
0 Kudos

Thanks T! In my tests, that cuts the deduping time in by 60% in the smaller record sets (100,000 records). Looks like large sets are an even bigger improvement.

VolkerBarth
Contributor
0 Kudos

@Ron: Besides the performace improvement, I would prefer Thomas's suggestion as it seems more straightforward IMHO: Directly deleting what is unnecessary instead of marking what has to remain and then deleting just the rest...

reimer_pods
Participant
0 Kudos

Keep it as simple as possible, that's it. And get better readability as a side effect.

Former Member

Well, after writing all that up, and re-reading it, it souded a bit like my question of week or two ago...

http://sqlanywhere-forum.sap.com/questions/472/how-can-i-use-a-subquery-to-limit-my-where-clause-in-...

And Mark Culp gave the answer there to this question too! Thanks Mark! You rock!

The answer is to use a derived table... and instead of trying to identify the records to delete, just identify the keepers, and delete everything else.

  Update mailing_list  ml 
      set dupe_flag = 'N'
        from ( 
       /* derived table of just keepers */   
      SELECT  email_address, max(record_id) as keeper  
          FROM mailing_list 
             GROUP BY email_address ) dt
        where ml.record_id = dt.keeper ;

DELETE FROM mailing_list WHERE dupe_flag is null;

That does it, and it processes my 500,000 test set in 50 seconds. I haven't scaled bigger than that yet.

Update: I just ran this against 1.27 million records in 141 seconds! It seems to scale up nicely!

Ron

MarkCulp
Participant

If you run this type of update more than once, and new records can be added between runs, then you will need to reset your dupe_flag to null before running your update ... set dupe_flag = 'N'. ... because obviously a record that use to be max(record_id) in the last run may no longer be the max when run again.