on 2015 Oct 12 5:51 PM
I have a tab delimited nvarchar column with many values. The problem is that we sometimes have duplicates in that column. How can I update the column and remove all duplicates?
Today it look like this:
ID--Material--Information--
---------------------------
1--Test1------Engine 12b \\x09 Diam 32 \\x09 Rear left \\x09 Engine 12b
2--Test1------Engine 24b \\x09 Diam 32 \\x09 Rear right \\x09 Diam 32 \\x09 Rear right
3--Test2------Engine 36b \\x09 Diam 32 \\x09 Rear left \\x09 Rear left
I wanna remove duplicates in column informations so it look like this:
ID--Material--Information--
---------------------------
1--Test1------Engine 12b \\x09 Diam 32 \\x09 Rear left
2--Test1------Engine 24b \\x09 Diam 32 \\x09 Rear right
3--Test2------Engine 36b \\x09 Diam 32 \\x09 Rear left
Here's a complete sample (FWIW, it's nice if you provide a simple test schema by yourself...):
create table Test ( ID int not null default autoincrement primary key, Material varchar(30) not null, Information varchar(1000) not null ); insert Test values (1, 'Test1', 'Engine 12b \\t Diam 32 \\t Rear left \\t Engine 12b'), (2, 'Test1', 'Engine 24b \\t Diam 32 \\t Rear right \\t Diam 32 \\t Rear right'), (3, 'Test2', 'Engine 36b \\t Diam 32 \\t Rear left \\t Rear left'); select * from Test;
Now use sa_split_list to build a query that turns the "blanks-and-tab-delimited" column into separate rows (while preserving the order) - note that cross apply is handy to use a column as parameter for a procedure:
select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \\t ') order by 1, 2; -- will return id,line_num,row_value 1,1,Engine 12b 1,2,Diam 32 1,3,Rear left 1,4,Engine 12b 2,1,Engine 24b 2,2,Diam 32 2,3,Rear right 2,4,Diam 32 2,5,Rear right 3,1,Engine 36b 3,2,Diam 32 3,3,Rear left 3,4,Rear left
Now, use that as a derived query (dt1) to remove duplicate values (and still preserving the order, i.e. for duplicate values, the first appearance is preserved):
select id, row_value, min(line_num) from (select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \\t ')) dt group by id, row_value order by 1, 3; -- will return id,row_value,min_line_num 1,Engine 12b,1 1,Diam 32,2 1,Rear left,3 2,Engine 24b,1 2,Diam 32,2 2,Rear right,3 3,Engine 36b,1 3,Diam 32,2 3,Rear left,3
Now, use that again as a derived query dt2 to re-build the list:
select id, list(row_value, ' \\t ' order by min_line_num) as Deduplicated_Information from (select id, row_value, min(line_num) as min_line_num from (select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \\t ')) dt1 group by id, row_value) dt2 group by id order by 1; -- will return id,Deduplicated_Information 1,Engine 12b \\t Diam 32 \\t Rear left 2,Engine 24b \\t Diam 32 \\t Rear right 3,Engine 36b \\t Diam 32 \\t Rear left
Now, build a join with the original table to check the results:
select Test.*, Deduplicated_Information from Test inner join (select id, list(row_value, ' \\t ' order by min_line_num) as Deduplicated_Information from (select id, row_value, min(line_num) as min_line_num from (select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \\t ')) dt1 group by id, row_value) dt2 group by id) dt3 on Test.ID = dt3.id order by 1; -- will return ID,Material,Information,Deduplicated_Information 1,Test1,Engine 12b \\t Diam 32 \\t Rear left \\t Engine 12b,Engine 12b \\t Diam 32 \\t Rear left 2,Test1,Engine 24b \\t Diam 32 \\t Rear right \\t Diam 32 \\t Rear right,Engine 24b \\t Diam 32 \\t Rear right 3,Test2,Engine 36b \\t Diam 32 \\t Rear left \\t Rear left,Engine 36b \\t Diam 32 \\t Rear left
and use that join to update the real table:
update Test set Information = Deduplicated_Information from Test inner join (select id, list(row_value, ' \\t ' order by min_line_num) as Deduplicated_Information from (select id, row_value, min(line_num) as min_line_num from (select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \\t ')) dt1 group by id, row_value) dt2 group by id) dt3 on Test.ID = dt3.id; select * from Test; -- will return ID,Material,Information 1,Test1,Engine 12b \\t Diam 32 \\t Rear left 2,Test1,Engine 24b \\t Diam 32 \\t Rear right 3,Test2,Engine 36b \\t Diam 32 \\t Rear left
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
For me (using 12.0.1.4301), your statement results in a -824 error ("Illegal reference to correlation name 'Test'")
AFAIK it's necessary to use an APPLY operator or a LATERAL expression if one wants to use a procedure in the FROM clause and has to supply column values from other tables as procedure parameters. Cf. a search for the "lateral" tag:)
Hm, that seems to be a difference between v12 and above - with the latest v16 build (16.0.0.2178) the comma operator is accepted as with v17.
FWIW, the following semantically equivalent expression with a CROSS JOIN is also rejected in v12 (with the same -824 error) and allowed in v16/v17:
select distinct id, line_num, row_value from Test cross join sa_split_list(Information, ' \\t ') order by 1, 2;
Before giving you this answer, I honestly stole it from this page. I do not want to sound rude, but this was my search request.
Ok, returning to the proposal. Here it is:
SELECT DISTINCT(row_value)
FROM sa_split_list('Engine 12b \\x09 Diam 32 \\x09 Rear left \\x09 Engine 12b', ' \\x09 ');
Result:
row_value
-----------
Engine 12b
Diam 32
Rear left
I think, this will be a quite incorrect to do everything on the SQL server. I would suggest you to change your software code so it will use SET collections instead of LIST, when you serialize tokens (or deserialize) to/from the database. Plus, this kind of normalization could be implemented once for the column, so you will never have such question anymore.
At least this is how I would solve this task.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Just to add:
With LIST() you can turn the distinct rows into the single column "Information" again (though that might violate normalization rules, as Vlad has pointed out).
Based on that, you can certainly combine sa_split_list() and list() to update the columns containing duplicate tokens in one single update statement.
User | Count |
---|---|
67 | |
10 | |
10 | |
10 | |
10 | |
8 | |
8 | |
7 | |
5 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.