Sharing the information you need to get development done!
Welcome to DevAuthority.Com Sign in | Join | Help
in Search

be-#

Using Microsoft Transact-SQL Correlated Query to Clean Up Duplicate Records

This is something I ran across as part of my job.  The query at first seems to work by magic, but some thought into what it must be doing helped.  I share my conclusions here.  I looked up some Microsoft documentation on it, which was tantalizing and confusing at the same time, as is typical.

Here's the Microsoft documentation:  http://www.databasejournal.com/features/mssql/article.php/3485291

Here's the science behind the magic.

Let's talk about this query:

select * from credential c

where exists (select null from credential c2

where c.key = c2.key and c.id > c2.id

What this query does is that given the case where a set of records have a certain key, this query will return all records from that set EXCEPT for the one with the lowest id.

Where this is particularly powerful is in a case like this:

delete c from credential c

where exists (select null from credential c2

where c.key = c2.key and c.id > c2.id

That is, for a given set of records with a certain key, this query will DELETE all records from that set EXCEPT the one with the lowest id (presumably the oldest one).  VERY useful when cleaning up data corruption having to do with duplicate records!

What's it doing?

Note the where clause.  First, it ensures that the 'key' field matches.  Otherwise, the query is doing a cross-product of the credential table with itself.  It then asks the question, Is the id from table 1 greater than table 2?  If it is, it returns (deletes) that record.  Otherwise, it does not.

Let's look at that with some data.

Key Id
100 10
100 20
100 30
200 40
200 50

Below is the cross-product it does, and the answer to its question, by row:

Key0 Id0 Key1 Id1 Id0 > Id1?
100 10 100 10 No
100 10 100 20 No
100 10 100 30 No
100 20 100 10 Yes
100 20 100 20 No
100 20 100 30 No
100 30 100 10 Yes
100 30 100 20 Yes
100 30 100 30 No
200 40 200 40 No
200 40 200 50 No
200 50 200 40 Yes
200 50 200 50 No

...which means that, for the delete query, it will remove records as follows.  The Id=30 is repeated and will be "deleted" twice.

Key Id
100 20
100 30
200 50

This leaves, of course, what we wanted, which was to have a single record for each Key, and the Id that is left over is the lowest one for each Key:

Key Id
100 10
200 40

So, not really magic after all.  And very powerful.  I don't know about you, but I love to know what is going on behind the scenes -- that's what makes the creativity and the fun possible!

Published Tuesday, December 11, 2007 2:18 PM by ebuatois

Comments

 

ebuatois said:

Ok, here's one of the benefits of actually thinking about what something is doing.  There's a simpler way to write the above queries.  The whole "exists" thing above isn't needed.  The below queries will behave the same way.

select * from credential c, credential c2
where c.[key]=c2.[key] and c.id > c2.id

delete c from credential c, credential c2
where c.[key]=c2.[key] and c.id > c2.id

I realized that it was the ">" symbol that was doing the "magic."  And then I thought about how the ">" join operator for the id field was used relative to how the far more familiar "=" join operator was used for the key field.  When you think about how the behavior changes depending on whether you use the "=" operator vs. the ">" operator in a join, it becomes obvious that this is the only way it COULD work.  There's no fancy database engine processing going on here, no under-the-covers cross-product hidden temporary table -- just the cranking out of comparisons between one value and another.  That's it.

This gives me a deeper appreciation of the elegant simplicity and power of SQL queries.
December 12, 2007 10:31 AM
Anonymous comments are disabled







This Blog

Post Calendar

<December 2007>
SuMoTuWeThFrSa
2526272829301
2345678
9101112131415
16171819202122
23242526272829
303112345

Syndication



Powered by Community Server, by Telligent Systems