Friday, July 27, 2012

Saucy Jacky - Removing Dupes (Duplicates) from a SQL result (record set) in a Query or Stored Procedure

Hack the Gibson

In my profession (financial) keeping the cancelled orders are just as important as keeping the actual orders that are fulfilled.  However, in your applications you might not necessarily wish to display this information.  For example, in a report that shows month end reports of actual sales.  There are many ways to gut (Saucy Jacky) the dupes from the return set at the SQL level without having to run complicated and time consuming .Net or Linq code.

So take for instance we have an orders database, that has order_archive as a table and we’ll want the order_id, invoice_id, customer_name, customer_id, and for S&Gs we’ll throw in shipping_address.  So in simple terms our query will look like -

Select
order_id
, invoice_id
, customer_name
, customer_id
, shipping_address
From
orders..order_archive


This is fairly simple enough.  However, in the schema an order_id should appear only once unless it has been cancelled (thus it will have a buy and a cancel).  So how do we get rid of the dupes?  Well first we’ll need to employ a temp table -



Create
Table
#temp_results
(
[order_id] Int
, [invoice_id] Int
, [customer_name] Varchar(250)
,
[customer_id] Int
, [shipping_address] Varchar(MAX)
)

This creates our temporary table and assigns column names and Sql Data Types.  Obviously, these would match the types which match the columns in the orders table where we are getting the original data from.

Then from here we need to perform an insert from our select statement -



Insert
Into
#temp_results
(
[order_id]
, [invoice_id]
, [customer_name]
, [customer_id]
, [shipping_address]
)


Here SQL is using ordinal positioning from our SQL Select to populate the columns/records.  It is imperative to have the Temporary Table contain the same number of columns as the select statement.  Additionally, you must populate every column in the temporary table with your insert statement.  Otherwise, SQL will B&M about there being a difference and you’ll end a day smacking your head against the wall.  Just because you load the temporary table with data does not mean you need to display (as will be shown below).  For instance, I only care about the order_id to get the count and remove the dupes.  Other than that it provides no other function in the concept and context that we are using here, thus I will leave it out of my final select statement.  Now we need to get rid of the dupes -



DELETE
FROM
#temp_results
WHERE
order_id IN (
SELECT
[order_id]
FROM
#temp_results
GROUP BY
[order_id]
HAVING
COUNT(*) > 1
)


 



So here we are selecting the order_id from our temp table and anyone that has more then one result (using the COUNT method) we’ll simply remove the record from the temp table.



Now we need to select the columns we wish to use in our report or display on the website -



Select 
[invoice_id]
, [customer_name]
, [customer_id]
, [shipping_address]
From #temp_results

Notice here that I did not select order_id.  As a final clean-up we need to delete the temp table.  This is because we build the temp table in SQL Server memory (thus, it resides on the server) and until we delete it will remain there -




Drop Table #temp_results

So now, we have created a temporary table, inserted records into it from our select statement.  Then deleted (gutted) the duplicate order_id out.  Selected the remaining records from the temporary table and finally deleted the temporary table.  The entire code listing is below -




Create
Table
#temp_results
(
[order_id] Int
, [invoice_id] Int
, [customer_name] Varchar(250)
,
[customer_id] Int
, [shipping_address] Varchar(MAX)
)
Insert
Into
#temp_results
(
[order_id]
, [invoice_id]
, [customer_name]
, [customer_id]
, [shipping_address]
)
Select
order_id
, invoice_id
, customer_name
, customer_id
, shipping_address
From
orders..order_archive

DELETE
FROM
#temp_results
WHERE
order_id IN (
SELECT
[order_id]
FROM
#temp_results
GROUP BY
[order_id]
HAVING
COUNT(*) > 1
)

Select
[invoice_id]
, [customer_name]
, [customer_id]
, [shipping_address]
From #temp_results
Drop Table #temp_results

You utilize this logic anywhere you can use TSQL – for instance in a Stored Procedure (SP), Query, or dynamic query.


Most importantly remember to delete the temporary table.  Failure to do so will result in an error the next time the query is run saying the table has already been created.

Happy .Netting…

Hack the Gibson

No comments:

Post a Comment