Friday, March 30, 2012

remove duplicates, add PK..need help

I am currently cleaning up a "loose" database, by adding primary keys on a
few particular tables that currently have none.
The Primary key will contain 2 fields, but before I add the pk, I need to
delete any duplicates. There should be none or very few that snuck by the
application, so deleting them is not a concern.
Here is a sample of the CURRENT table format:
CREATE TABLE OrderSalesReps(
OrderID int NOT NULL,
SalesRepID int NOT NULL,
Revenue_1 decimal(14,2) NOT NULL,
Revenue_2 decimal(14,2) NOT NULL
)
So the new Primary Key will be on [OrderID and SalesRepID]
but there may be duplicates that exist currently.
What is a clean query to delete dups before I create the Primary Key.
Thanks in advance.
[Please note: OrderID in this table is a Foreign Key which links to
Orders.OrderID - doesn't matter for this case, but thought I'd mention]Hi Chris,
Which version of SQL Server are you working with?
In SQL Server 2005 the solution is pretty simple and fast:
WITH Dups AS
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY OrderID, SalesRepID
ORDER BY OrderID, SalesRepID) AS RowNum
FROM OrderSalesReps
)
DELETE FROM Dups WHERE RowNum > 1;
BG, SQL Server MVP
www.SolidQualityLearning.com
www.insidetsql.com
Anything written in this message represents my view, my own view, and
nothing but my view (WITH SCHEMABINDING), so help me my T-SQL code.
"Chris" <rooster575@.hotmail.com> wrote in message
news:utBY9xXhGHA.3904@.TK2MSFTNGP02.phx.gbl...
>I am currently cleaning up a "loose" database, by adding primary keys on a
>few particular tables that currently have none.
> The Primary key will contain 2 fields, but before I add the pk, I need to
> delete any duplicates. There should be none or very few that snuck by the
> application, so deleting them is not a concern.
> Here is a sample of the CURRENT table format:
> CREATE TABLE OrderSalesReps(
> OrderID int NOT NULL,
> SalesRepID int NOT NULL,
> Revenue_1 decimal(14,2) NOT NULL,
> Revenue_2 decimal(14,2) NOT NULL
> )
> So the new Primary Key will be on [OrderID and SalesRepID]
> but there may be duplicates that exist currently.
> What is a clean query to delete dups before I create the Primary Key.
> Thanks in advance.
> [Please note: OrderID in this table is a Foreign Key which links to
> Orders.OrderID - doesn't matter for this case, but thought I'd mention]
>
>|||Unfortunately, this solution has to work with SQL Server 2000 and 2005.
Thanks
-Chris
"Itzik Ben-Gan" <itzik@.REMOVETHIS.SolidQualityLearning.com> wrote in message
news:eMVD1ZYhGHA.4144@.TK2MSFTNGP02.phx.gbl...
> Hi Chris,
> Which version of SQL Server are you working with?
> In SQL Server 2005 the solution is pretty simple and fast:
> WITH Dups AS
> (
> SELECT *,
> ROW_NUMBER() OVER(PARTITION BY OrderID, SalesRepID
> ORDER BY OrderID, SalesRepID) AS RowNum
> FROM OrderSalesReps
> )
> DELETE FROM Dups WHERE RowNum > 1;
> --
> BG, SQL Server MVP
> www.SolidQualityLearning.com
> www.insidetsql.com
> Anything written in this message represents my view, my own view, and
> nothing but my view (WITH SCHEMABINDING), so help me my T-SQL code.
>
> "Chris" <rooster575@.hotmail.com> wrote in message
> news:utBY9xXhGHA.3904@.TK2MSFTNGP02.phx.gbl...
>|||First you need to decide on what your rules are for deciding which
duplicate to keep. If you're talking about exact duplicates (i.e., all
column values are identical) then you can select the unique records
into a temporary table, truncate the original table, add the primary
key, then insert the unique records back into the original table (you
could insert back then recreate the PK if you wanted as well). For
example:
SELECT DISTINCT OrderID, SalesRepID, Revenue_1, Revenue_2
INTO tmp_OrderSalesReps
TRUNCATE TABLE OrderSalesReps
ALTER TABLE OrderSalesReps
ADD CONSTRAINT PK_OrderSalesReps PRIMARY KEY CLUSTERED (OrderID,
SalesRepID)
INSERT OrderSalesReps (OrderID, SalesRepID, Revenue_1, Revenue_2)
SELECT OrderID, SalesRepID, Revenue_1, Revenue_2
FROM tmp_OrderSalesReps
--
Of course, if you end up with two rows with identical OrderIDs and
SalesRepIDs, but different Revenues then you need to decide which one
to keep. If it's really only a few rows then you could manually remove
them from your work table before inserting back to the original table.
If you wanted to use the highest revenue values found for an order/rep
(for example) then you might change your SELECT...INTO to something
like:
SELECT OrderID, SalesRepID, MAX(Revenue_1), MAX(Revenue_2)
INTO tmp_OrderSalesReps
FROM OrderSalesReps
GROUP BY OrderID, SalesRepID
HTH,
-Tom.
Chris wrote:
> I am currently cleaning up a "loose" database, by adding primary keys on a
> few particular tables that currently have none.
> The Primary key will contain 2 fields, but before I add the pk, I need to
> delete any duplicates. There should be none or very few that snuck by the
> application, so deleting them is not a concern.
> Here is a sample of the CURRENT table format:
> CREATE TABLE OrderSalesReps(
> OrderID int NOT NULL,
> SalesRepID int NOT NULL,
> Revenue_1 decimal(14,2) NOT NULL,
> Revenue_2 decimal(14,2) NOT NULL
> )
> So the new Primary Key will be on [OrderID and SalesRepID]
> but there may be duplicates that exist currently.
> What is a clean query to delete dups before I create the Primary Key.
> Thanks in advance.
> [Please note: OrderID in this table is a Foreign Key which links to
> Orders.OrderID - doesn't matter for this case, but thought I'd mention]

No comments:

Post a Comment