Wednesday, March 28, 2012

Remove Duplicate Data

I have a query that for one reason or another produces duplicate information in the result set. I have tried using DISTINCT and GROUP BY to remove the duplicates but because of the nature of the data I cannot get this to work, here is an example fo the data I am working with

ID Name Add1 Add2
1 Matt 16 Nowhere St Glasgow
1 Matt 16 Nowhere St Glasgow, Scotland
2 Jim 23 Blue St G65 TX
3 Bill 45 Red St
3 Bill 45 red St London

The problem is that a user can have one or more addresses!! I would like to be able to remove the duplicates by keeping the first duplicate ID that appears and getting rid of the second one. Any ideas?

Cheers1. think about normalizing your data ...
2. add some primary key (identity column primaryid?)
3. then try following

select * from yourtable y
where y.primaryid = (select min(yourtable.primaryid) from yourtable y2 where y2.id = y.id)

or

select * from yourtable y1
join (select min(primaryid) as primaryid from yourtable group by id) y2 on y2.primaryid = y1.primaryid

or something like that...|||Hi Ludenka,

Thanks for that. Unfortunately Normalising the data is not an option, I just have to work with it the way it is...Managed to get it to work using a rather long winded stored procedure but I'll have a look at your suggestion and see if I can do it a better way:

DECLARE @.patno int
DECLARE @.patfname varChar(255)
DECLARE @.patsname varChar(255)
DECLARE @.DOB DATETIME
DECLARE @.patadd1 varChar(255)
DECLARE @.patadd2 varChar(255)
DECLARE @.PreviousPatientID int

CREATE TABLE #tmp_tblPrivDemo (
patno int,
patfname varChar(255),
patsname varChar(255),
DOB DATETIME,
patadd1 varChar(255),
patadd2 varChar(255)
)

--First of all load all Patient details into a cursor
DECLARE PatientIDCursor CURSOR FOR
SELECT AxTbl_Patient.Patient_ID,
AxTbl_Patient.Forename1,
AxTbl_Patient.Surname,
AxTbl_Patient.Date_Of_Birth,
AxTbl_Patient_Address.Address1,
AxTbl_Patient_Address.Address2
FROM
AxTbl_Patient
LEFT JOIN AxTbl_Patient_Address ON AxTbl_Patient_Address.Patient_ID = AxTbl_Patient.Patient_ID
LEFT JOIN AxTbl_Patient_GP ON AxTbl_Patient_GP.Patient_ID = AxTbl_Patient.Patient_ID
LEFT JOIN [SpFrm_Follow_Up_Death_Details_v1-0] ON [SpFrm_Follow_Up_Death_Details_v1-0].Patient_ID = AxTbl_Patient.Patient_ID
INNER JOIN [SpFrm_Diagnostic_Procedures_v1-1] ON [SpFrm_Diagnostic_Procedures_v1-1].Patient_ID = AxTbl_Patient.Patient_ID
WHERE
(([SpFrm_Diagnostic_Procedures_v1-1].[Date of Diagnosis] >= '01-Apr-2003' AND [SpFrm_Diagnostic_Procedures_v1-1].[Date of Diagnosis] <= '30-Sep-2003') AND (AxTbl_Patient_Address.IsCurrentAddress = 'yes' OR AxTbl_Patient_Address.IsCurrentAddress IS NULL) AND (AxTbl_Patient_GP.Date_GP_Changed IS NULL))

OPEN PatientIDCursor
FETCH NEXT FROM PatientIDCursor INTO @.patno, @.patfname, @.patsname, @.DOB, @.patadd1, @.patadd2
SET @.PreviousPatientID = 0

WHILE @.@.FETCH_STATUS = 0
BEGIN
--Then only add to temp table if not already added, so no dups.
IF @.PreviousPatientID <> @.patno
BEGIN
INSERT INTO #tmp_tblPrivDemo
SELECT @.patno, @.patfname, @.patsname, @.DOB, @.patadd1, @.patadd2
END
SET @.PreviousPatientID = @.patno
FETCH NEXT FROM PatientIDCursor INTO @.patno, @.patfname, @.patsname, @.DOB, @.patadd1, @.patadd2

END --end loop.
CLOSE PatientIDCursor
DEALLOCATE PatientIDCursor

SELECT * FROM #tmp_tblPrivDemo ORDER BY patno
DROP TABLE #tmp_tblPrivDemo|||well, normalizing is not important for the task, it was only a hint...

are you able to add the identity column?
if yes, then my solution should work
if not, you could replace all of those

y1.primaryid = y2.primaryid

with comparison off all in table included columns, but ... it's not nice at all|||I have a query that for one reason or another produces duplicate information in the result set. I have tried using DISTINCT and GROUP BY to remove the duplicates but because of the nature of the data I cannot get this to work, here is an example fo the data I am working with

ID Name Add1 Add2
1 Matt 16 Nowhere St Glasgow
1 Matt 16 Nowhere St Glasgow, Scotland
2 Jim 23 Blue St G65 TX
3 Bill 45 Red St
3 Bill 45 red St London

The problem is that a user can have one or more addresses!! I would like to be able to remove the duplicates by keeping the first duplicate ID that appears and getting rid of the second one. Any ideas?

Cheers

perhaps a quicker way of doing it than your cursor would be:

declare @.temp_name_addr TABLE (
id int not null,
name char(n) not null,
add1 char(n) null,
add2 char(n) null
)

insert @.temp_name_addr (id, name)
select distinct id, name
from permanent_table

update T
set T.add1 = P.add1,
T.add2 = P.add2
from @.temp_name_addr T,
permanent_table P
where T.id = P.id

select * from @.temp_name_addr

...of course which address you get will be arbitrary (both in your solution and mine) there's no rigid "first" without an order by clause...sql

No comments:

Post a Comment