cancel
Showing results for 
Search instead for 
Did you mean: 

Remove duplicates in a tab delimited string?

2,756

I have a tab delimited nvarchar column with many values. The problem is that we sometimes have duplicates in that column. How can I update the column and remove all duplicates?

Today it look like this:

ID--Material--Information--
---------------------------
1--Test1------Engine 12b \\x09 Diam 32 \\x09 Rear left \\x09 Engine 12b
2--Test1------Engine 24b \\x09 Diam 32 \\x09 Rear right \\x09 Diam 32 \\x09 Rear right  
3--Test2------Engine 36b \\x09 Diam 32 \\x09 Rear left \\x09 Rear left

I wanna remove duplicates in column informations so it look like this:


ID--Material--Information--
---------------------------
1--Test1------Engine 12b \\x09 Diam 32 \\x09 Rear left
2--Test1------Engine 24b \\x09 Diam 32 \\x09 Rear right   
3--Test2------Engine 36b \\x09 Diam 32 \\x09 Rear left

Accepted Solutions (1)

Accepted Solutions (1)

VolkerBarth
Contributor

Here's a complete sample (FWIW, it's nice if you provide a simple test schema by yourself...):

create table Test
(
    ID int not null default autoincrement primary key,
    Material varchar(30) not null,
    Information varchar(1000) not null
);
insert Test values
   (1, 'Test1', 'Engine 12b \\t Diam 32 \\t Rear left \\t Engine 12b'),
   (2, 'Test1', 'Engine 24b \\t Diam 32 \\t Rear right \\t Diam 32 \\t Rear right'),  
   (3, 'Test2', 'Engine 36b \\t Diam 32 \\t Rear left \\t Rear left');

select * from Test;

Now use sa_split_list to build a query that turns the "blanks-and-tab-delimited" column into separate rows (while preserving the order) - note that cross apply is handy to use a column as parameter for a procedure:

select distinct id, line_num, row_value
from Test cross apply sa_split_list(Information, ' \\t ')
order by 1, 2;

-- will return
id,line_num,row_value
1,1,Engine 12b
1,2,Diam 32
1,3,Rear left
1,4,Engine 12b
2,1,Engine 24b
2,2,Diam 32
2,3,Rear right
2,4,Diam 32
2,5,Rear right
3,1,Engine 36b
3,2,Diam 32
3,3,Rear left
3,4,Rear left

Now, use that as a derived query (dt1) to remove duplicate values (and still preserving the order, i.e. for duplicate values, the first appearance is preserved):

select id, row_value, min(line_num)
from
   (select distinct id, line_num, row_value
    from Test cross apply sa_split_list(Information, ' \\t ')) dt
group by id, row_value
order by 1, 3;

-- will return 
id,row_value,min_line_num
1,Engine 12b,1
1,Diam 32,2
1,Rear left,3
2,Engine 24b,1
2,Diam 32,2
2,Rear right,3
3,Engine 36b,1
3,Diam 32,2
3,Rear left,3

Now, use that again as a derived query dt2 to re-build the list:

select id, list(row_value, ' \\t ' order by min_line_num) as Deduplicated_Information
from
   (select id, row_value, min(line_num) as min_line_num
    from
       (select distinct id, line_num, row_value
        from Test cross apply sa_split_list(Information, ' \\t ')) dt1
    group by id, row_value) dt2
group by id
order by 1;

-- will return
id,Deduplicated_Information
1,Engine 12b \\t Diam 32 \\t Rear left
2,Engine 24b \\t Diam 32 \\t Rear right
3,Engine 36b \\t Diam 32 \\t Rear left

Now, build a join with the original table to check the results:

select Test.*, Deduplicated_Information
from Test inner join
   (select id, list(row_value, ' \\t ' order by min_line_num) as Deduplicated_Information
    from
      (select id, row_value, min(line_num) as min_line_num
       from
          (select distinct id, line_num, row_value
           from Test cross apply sa_split_list(Information, ' \\t ')) dt1
       group by id, row_value) dt2
    group by id) dt3
   on Test.ID = dt3.id
order by 1;

-- will return
ID,Material,Information,Deduplicated_Information
1,Test1,Engine 12b \\t Diam 32 \\t Rear left \\t Engine 12b,Engine 12b \\t Diam 32 \\t Rear left
2,Test1,Engine 24b \\t Diam 32 \\t Rear right \\t Diam 32 \\t Rear right,Engine 24b \\t Diam 32 \\t Rear right
3,Test2,Engine 36b \\t Diam 32 \\t Rear left \\t Rear left,Engine 36b \\t Diam 32 \\t Rear left

and use that join to update the real table:

update Test
set Information = Deduplicated_Information
from Test inner join
   (select id, list(row_value, ' \\t ' order by min_line_num) as Deduplicated_Information
    from
      (select id, row_value, min(line_num) as min_line_num
       from
          (select distinct id, line_num, row_value
           from Test cross apply sa_split_list(Information, ' \\t ')) dt1
       group by id, row_value) dt2
    group by id) dt3
   on Test.ID = dt3.id;

select * from Test;

-- will return
ID,Material,Information
1,Test1,Engine 12b \\t Diam 32 \\t Rear left
2,Test1,Engine 24b \\t Diam 32 \\t Rear right
3,Test2,Engine 36b \\t Diam 32 \\t Rear left
Vlad
Product and Topic Expert
Product and Topic Expert
0 Kudos

Volker, I am curious. Could you please explain my, why you have decided to use CROSS APPLY? I tested a simpler request, and it works fine as well:

SELECT DISTINCT id, line_num, row_value FROM Test, sa_split_list(Information, ' \\t ') ORDER BY 1, 2;

VolkerBarth
Contributor
0 Kudos

For me (using 12.0.1.4301), your statement results in a -824 error ("Illegal reference to correlation name 'Test'")

AFAIK it's necessary to use an APPLY operator or a LATERAL expression if one wants to use a procedure in the FROM clause and has to supply column values from other tables as procedure parameters. Cf. a search for the "lateral" tag:)

Vlad
Product and Topic Expert
Product and Topic Expert
0 Kudos

Strange. I was testing my query on SA17. Even plans look the same (I used the feature found by Breck that compares plans). The only small difference is the Final plan build time.

VolkerBarth
Contributor
0 Kudos

Hm, that seems to be a difference between v12 and above - with the latest v16 build (16.0.0.2178) the comma operator is accepted as with v17.

FWIW, the following semantically equivalent expression with a CROSS JOIN is also rejected in v12 (with the same -824 error) and allowed in v16/v17:

select distinct id, line_num, row_value
from Test cross join sa_split_list(Information, ' \\t ')
order by 1, 2;
VolkerBarth
Contributor
0 Kudos

Could you please explain my, why you have decided to use CROSS APPLY?

After further investigation, I guess my answer is "because it think I understand that construction, and because it works with v12"🙂

Answers (1)

Answers (1)

Vlad
Product and Topic Expert
Product and Topic Expert

Before giving you this answer, I honestly stole it from this page. I do not want to sound rude, but this was my search request.

Ok, returning to the proposal. Here it is:

SELECT DISTINCT(row_value)
FROM sa_split_list('Engine 12b \\x09 Diam 32 \\x09 Rear left \\x09 Engine 12b', ' \\x09 ');

Result:

row_value
-----------
Engine 12b
Diam 32
Rear left

I think, this will be a quite incorrect to do everything on the SQL server. I would suggest you to change your software code so it will use SET collections instead of LIST, when you serialize tokens (or deserialize) to/from the database. Plus, this kind of normalization could be implemented once for the column, so you will never have such question anymore.

At least this is how I would solve this task.

VolkerBarth
Contributor

Just to add:

With LIST() you can turn the distinct rows into the single column "Information" again (though that might violate normalization rules, as Vlad has pointed out).

Based on that, you can certainly combine sa_split_list() and list() to update the columns containing duplicate tokens in one single update statement.

Vlad
Product and Topic Expert
Product and Topic Expert
0 Kudos

A good suggestion, I forgot to search for the function LIST. I am always thinking that sometimes it is better to fix everything in the application code, rather than to use SQL.