Wednesday, March 28, 2012

Remove duplicate records after importing via SSIS

Hello all,
I have some phone logs that I would like to import into table on a daily or
periodic basis. I would like to be able to elliminate any duplicate records
that it imports when it appends it to the table. Is there some T-SQL that I
can run that would help with this situation? I DO have one particular
field that has a unique ID for each record, so it *should* be pretty easy, if
I only knew what I was doing. ;-)
--
SketchySince you have a unique key and existing data, set up the insert using a not
exists clause
insert realtable
select ..
from holdingtable ht
where not exists (select * from realtable rt where rt.keyfield =ht.keyfield)
TheSQLGuru
President
Indicium Resources, Inc.
"sketchy" <sketchy@.discussions.microsoft.com> wrote in message
news:AD538EA5-A55B-45C1-9992-D0AAFD0AD0C6@.microsoft.com...
> Hello all,
> I have some phone logs that I would like to import into table on a daily
> or
> periodic basis. I would like to be able to elliminate any duplicate
> records
> that it imports when it appends it to the table. Is there some T-SQL that
> I
> can run that would help with this situation? I DO have one particular
> field that has a unique ID for each record, so it *should* be pretty easy,
> if
> I only knew what I was doing. ;-)
> --
> Sketchy|||On Thu, 23 Aug 2007 07:22:30 -0700, sketchy
<sketchy@.discussions.microsoft.com> wrote:
>Hello all,
>I have some phone logs that I would like to import into table on a daily or
>periodic basis. I would like to be able to elliminate any duplicate records
>that it imports when it appends it to the table. Is there some T-SQL that I
>can run that would help with this situation? I DO have one particular
>field that has a unique ID for each record, so it *should* be pretty easy, if
>I only knew what I was doing. ;-)
Create a staging table that matches the input data layout. Import the
new data into the staging table. Insert the data from the staging
table into the production table with a WHERE NOT EXISTS test to skip
the duplicates. Truncate the staging table before loading and
processing the next set of data.
The only tricky part left is if there are duplicates in the staging
table itself, as the NOT EXISTS test only prevents inserting them when
they are already there. If the entire row is identical that can be
handled with a DISTINCT. If the row has differences in some column
other than the unique ID you need to provide more information on which
one to choose (as well as raising questions about the entire process.)
Roy Harvey
Beacon Falls, CT|||sketchy,
It would be better if you can avoid inserting the duplicated rows, as
TheSQLGuru stated. You need to work with the group of columns that make the
row unique, let us suppose that they are (c1, c2, c3), then:
delete dbo.t1
where exists (
select *
from dbo.t1 as a
where a.c1 = dbo.t1.c1
and a.c2 = dbo.t1.c2
and a.c3 = dbo.t1.c3
and a.[id] < dbo.t1.[id]
)
-- or
-- 2005
;with cte
as
(
select c1, ..., cn, row_number() over(partition by c1, c2, c3 order by [id])
as rn
from dbo.t1
)
delete cte
where rn > 1;
AMB
"sketchy" wrote:
> Hello all,
> I have some phone logs that I would like to import into table on a daily or
> periodic basis. I would like to be able to elliminate any duplicate records
> that it imports when it appends it to the table. Is there some T-SQL that I
> can run that would help with this situation? I DO have one particular
> field that has a unique ID for each record, so it *should* be pretty easy, if
> I only knew what I was doing. ;-)
> --
> Sketchy|||Wow guys. this is all great information. I never thought of having a
staging table. Let me soak this in a bit and if I have any more questions, I
know who to ask.
--
Sketchy
"Alejandro Mesa" wrote:
> sketchy,
> It would be better if you can avoid inserting the duplicated rows, as
> TheSQLGuru stated. You need to work with the group of columns that make the
> row unique, let us suppose that they are (c1, c2, c3), then:
> delete dbo.t1
> where exists (
> select *
> from dbo.t1 as a
> where a.c1 = dbo.t1.c1
> and a.c2 = dbo.t1.c2
> and a.c3 = dbo.t1.c3
> and a.[id] < dbo.t1.[id]
> )
> -- or
> -- 2005
> ;with cte
> as
> (
> select c1, ..., cn, row_number() over(partition by c1, c2, c3 order by [id])
> as rn
> from dbo.t1
> )
> delete cte
> where rn > 1;
>
> AMB
> "sketchy" wrote:
> > Hello all,
> >
> > I have some phone logs that I would like to import into table on a daily or
> > periodic basis. I would like to be able to elliminate any duplicate records
> > that it imports when it appends it to the table. Is there some T-SQL that I
> > can run that would help with this situation? I DO have one particular
> > field that has a unique ID for each record, so it *should* be pretty easy, if
> > I only knew what I was doing. ;-)
> > --
> > Sketchy|||Okay, so here is what I have done. (this is a SQL 2005 DB by the way...)
1. I have my original table, which will be used for my reporting needs,
called 'phones'. It has a handfull of fields, (e.g. "Field1" "Field2"
"Field3" etc. but the main field that has the unique info in it is called
"GlobalCallID"
2. I have created a new table, which will be used for my staging of data,
called 'phonelogstaging'. This database has the EXACT same field names and
types.
3. I've set up SSIS to import my log files into the staging table
("phonelogstaging"). It wipes out any previous data in this staging table,
so there is no possibility of duplicates in this table. Everything is
working good there.
Both tables have a field called GlobalCallID that has the unique number in
it that I should be able to check against.
So considering the above, how would my statement look?
--
Sketchy
"Roy Harvey" wrote:
> On Thu, 23 Aug 2007 07:22:30 -0700, sketchy
> <sketchy@.discussions.microsoft.com> wrote:
> >Hello all,
> >
> >I have some phone logs that I would like to import into table on a daily or
> >periodic basis. I would like to be able to elliminate any duplicate records
> >that it imports when it appends it to the table. Is there some T-SQL that I
> >can run that would help with this situation? I DO have one particular
> >field that has a unique ID for each record, so it *should* be pretty easy, if
> >I only knew what I was doing. ;-)
> Create a staging table that matches the input data layout. Import the
> new data into the staging table. Insert the data from the staging
> table into the production table with a WHERE NOT EXISTS test to skip
> the duplicates. Truncate the staging table before loading and
> processing the next set of data.
> The only tricky part left is if there are duplicates in the staging
> table itself, as the NOT EXISTS test only prevents inserting them when
> they are already there. If the entire row is identical that can be
> handled with a DISTINCT. If the row has differences in some column
> other than the unique ID you need to provide more information on which
> one to choose (as well as raising questions about the entire process.)
> Roy Harvey
> Beacon Falls, CT
>|||Okay, so here is what I have done. (this is a SQL 2005 DB by the way...)
1. I have my original table, which will be used for my reporting needs,
called 'phones'. It has a handfull of fields, (e.g. "Field1" "Field2"
"Field3" etc. but the main field that has the unique info in it is called
"GlobalCallID"
2. I have created a new table, which will be used for my staging of data,
called 'phonelogstaging'. This database has the EXACT same field names and
types.
3. I've set up SSIS to import my log files into the staging table
("phonelogstaging"). It wipes out any previous data in this staging table,
so there is no possibility of duplicates in this table. Everything is
working good there.
Both tables have a field called GlobalCallID that has the unique number in
it that I should be able to check against.
So considering the above, how would my statement look?
--
Sketchy
"TheSQLGuru" wrote:
> Since you have a unique key and existing data, set up the insert using a not
> exists clause
> insert realtable
> select ..
> from holdingtable ht
> where not exists (select * from realtable rt where rt.keyfield => ht.keyfield)
>
> --
> TheSQLGuru
> President
> Indicium Resources, Inc.
> "sketchy" <sketchy@.discussions.microsoft.com> wrote in message
> news:AD538EA5-A55B-45C1-9992-D0AAFD0AD0C6@.microsoft.com...
> > Hello all,
> >
> > I have some phone logs that I would like to import into table on a daily
> > or
> > periodic basis. I would like to be able to elliminate any duplicate
> > records
> > that it imports when it appends it to the table. Is there some T-SQL that
> > I
> > can run that would help with this situation? I DO have one particular
> > field that has a unique ID for each record, so it *should* be pretty easy,
> > if
> > I only knew what I was doing. ;-)
> > --
> > Sketchy
>
>|||Okay, so here is what I have done. (this is a SQL 2005 DB by the way...)
1. I have my original table, which will be used for my reporting needs,
called 'phones'. It has a handfull of fields, (e.g. "Field1" "Field2"
"Field3" etc. but the main field that has the unique info in it is called
"GlobalCallID"
2. I have created a new table, which will be used for my staging of data,
called 'phonelogstaging'. This database has the EXACT same field names and
types.
3. I've set up SSIS to import my log files into the staging table
("phonelogstaging"). It wipes out any previous data in this staging table,
so there is no possibility of duplicates in this table. Everything is
working good there.
Both tables have a field called GlobalCallID that has the unique number in
it that I should be able to check against.
So considering the above, how would my statement look?
--
Sketchy
"Alejandro Mesa" wrote:
> sketchy,
> It would be better if you can avoid inserting the duplicated rows, as
> TheSQLGuru stated. You need to work with the group of columns that make the
> row unique, let us suppose that they are (c1, c2, c3), then:
> delete dbo.t1
> where exists (
> select *
> from dbo.t1 as a
> where a.c1 = dbo.t1.c1
> and a.c2 = dbo.t1.c2
> and a.c3 = dbo.t1.c3
> and a.[id] < dbo.t1.[id]
> )
> -- or
> -- 2005
> ;with cte
> as
> (
> select c1, ..., cn, row_number() over(partition by c1, c2, c3 order by [id])
> as rn
> from dbo.t1
> )
> delete cte
> where rn > 1;
>
> AMB
> "sketchy" wrote:
> > Hello all,
> >
> > I have some phone logs that I would like to import into table on a daily or
> > periodic basis. I would like to be able to elliminate any duplicate records
> > that it imports when it appends it to the table. Is there some T-SQL that I
> > can run that would help with this situation? I DO have one particular
> > field that has a unique ID for each record, so it *should* be pretty easy, if
> > I only knew what I was doing. ;-)
> > --
> > Sketchy|||On Thu, 23 Aug 2007 10:20:00 -0700, sketchy
<sketchy@.discussions.microsoft.com> wrote:
>Okay, so here is what I have done. (this is a SQL 2005 DB by the way...)
>1. I have my original table, which will be used for my reporting needs,
>called 'phones'. It has a handfull of fields, (e.g. "Field1" "Field2"
>"Field3" etc. but the main field that has the unique info in it is called
>"GlobalCallID"
>2. I have created a new table, which will be used for my staging of data,
>called 'phonelogstaging'. This database has the EXACT same field names and
>types.
>3. I've set up SSIS to import my log files into the staging table
>("phonelogstaging"). It wipes out any previous data in this staging table,
>so there is no possibility of duplicates in this table. Everything is
>working good there.
>Both tables have a field called GlobalCallID that has the unique number in
>it that I should be able to check against.
>So considering the above, how would my statement look?
Assuming that data already in the table does not need to be refreshed,
only new data added:
INSERT phones
SELECT <column list>
FROM phonelogstaging as A
WHERE NOT EXISTS
(SELECT * FROM phones as B
WHERE A.GlobalCallID = B.GlobalCallID)
If the incoming data itself has duplicates, add DISTINCT after the
word SELECT.
If you need to refresh the rest of the columns of matching rows from
the staging data, you would also run the following BEFORE the command
above.
UPDATE phones
SET col1 = A.col1,
col2 = A.col2
FROM phonelogstaging as A
WHERE phones.GlobalCallID = A.GlobalCallID
Roy Harvey
Beacon Falls, CT|||Hi Roy,
Thank you SO MUCH for your quick response.
1. Yes, only new data needs to be added. No data needs to be refreshed, so
that's good.
2. One little wrinkle in the plan is that much to my dismay, it does appear
that the "GlobalCallID" field isn't necessarily a unique number, but I do
have a field adjacent to it ("CallNumber") where no records would ever have
the same combination of the two. How would you ammend your last statement to
accomodate for this? (Ugh, I know... that probably doesn't make things
simpler)
--
Sketchy
"Roy Harvey" wrote:
> On Thu, 23 Aug 2007 10:20:00 -0700, sketchy
> <sketchy@.discussions.microsoft.com> wrote:
> >Okay, so here is what I have done. (this is a SQL 2005 DB by the way...)
> >
> >1. I have my original table, which will be used for my reporting needs,
> >called 'phones'. It has a handfull of fields, (e.g. "Field1" "Field2"
> >"Field3" etc. but the main field that has the unique info in it is called
> >"GlobalCallID"
> >
> >2. I have created a new table, which will be used for my staging of data,
> >called 'phonelogstaging'. This database has the EXACT same field names and
> >types.
> >
> >3. I've set up SSIS to import my log files into the staging table
> >("phonelogstaging"). It wipes out any previous data in this staging table,
> >so there is no possibility of duplicates in this table. Everything is
> >working good there.
> >
> >Both tables have a field called GlobalCallID that has the unique number in
> >it that I should be able to check against.
> >
> >So considering the above, how would my statement look?
> Assuming that data already in the table does not need to be refreshed,
> only new data added:
> INSERT phones
> SELECT <column list>
> FROM phonelogstaging as A
> WHERE NOT EXISTS
> (SELECT * FROM phones as B
> WHERE A.GlobalCallID = B.GlobalCallID)
> If the incoming data itself has duplicates, add DISTINCT after the
> word SELECT.
> If you need to refresh the rest of the columns of matching rows from
> the staging data, you would also run the following BEFORE the command
> above.
> UPDATE phones
> SET col1 = A.col1,
> col2 = A.col2
> FROM phonelogstaging as A
> WHERE phones.GlobalCallID = A.GlobalCallID
> Roy Harvey
> Beacon Falls, CT
>|||Just add the new field that DOES make a row unique to the not exists clause:
> INSERT phones
> SELECT <column list>
> FROM phonelogstaging as A
> WHERE NOT EXISTS
> (SELECT * FROM phones as B
> WHERE A.GlobalCallID = B.GlobalCallID
and a.CallNumber = b.CallNumber)
You can use as many fields as you need to ensure uniqueness.
--
TheSQLGuru
President
Indicium Resources, Inc.
"sketchy" <sketchy@.discussions.microsoft.com> wrote in message
news:CCA5542F-5C15-413A-8B8F-D2200C1BEAD3@.microsoft.com...
> Hi Roy,
> Thank you SO MUCH for your quick response.
> 1. Yes, only new data needs to be added. No data needs to be refreshed,
> so
> that's good.
> 2. One little wrinkle in the plan is that much to my dismay, it does
> appear
> that the "GlobalCallID" field isn't necessarily a unique number, but I do
> have a field adjacent to it ("CallNumber") where no records would ever
> have
> the same combination of the two. How would you ammend your last statement
> to
> accomodate for this? (Ugh, I know... that probably doesn't make things
> simpler)
> --
> Sketchy
>
> "Roy Harvey" wrote:
>> On Thu, 23 Aug 2007 10:20:00 -0700, sketchy
>> <sketchy@.discussions.microsoft.com> wrote:
>> >Okay, so here is what I have done. (this is a SQL 2005 DB by the
>> >way...)
>> >
>> >1. I have my original table, which will be used for my reporting needs,
>> >called 'phones'. It has a handfull of fields, (e.g. "Field1" "Field2"
>> >"Field3" etc. but the main field that has the unique info in it is
>> >called
>> >"GlobalCallID"
>> >
>> >2. I have created a new table, which will be used for my staging of
>> >data,
>> >called 'phonelogstaging'. This database has the EXACT same field names
>> >and
>> >types.
>> >
>> >3. I've set up SSIS to import my log files into the staging table
>> >("phonelogstaging"). It wipes out any previous data in this staging
>> >table,
>> >so there is no possibility of duplicates in this table. Everything is
>> >working good there.
>> >
>> >Both tables have a field called GlobalCallID that has the unique number
>> >in
>> >it that I should be able to check against.
>> >
>> >So considering the above, how would my statement look?
>> Assuming that data already in the table does not need to be refreshed,
>> only new data added:
>> INSERT phones
>> SELECT <column list>
>> FROM phonelogstaging as A
>> WHERE NOT EXISTS
>> (SELECT * FROM phones as B
>> WHERE A.GlobalCallID = B.GlobalCallID)
>> If the incoming data itself has duplicates, add DISTINCT after the
>> word SELECT.
>> If you need to refresh the rest of the columns of matching rows from
>> the staging data, you would also run the following BEFORE the command
>> above.
>> UPDATE phones
>> SET col1 = A.col1,
>> col2 = A.col2
>> FROM phonelogstaging as A
>> WHERE phones.GlobalCallID = A.GlobalCallID
>> Roy Harvey
>> Beacon Falls, CT|||It works!!!! ...Thanks guys for all of your help in this. I hope for some
good computer karma to come your way.
One interesting little tidbit is that I was unable to get it to work by
specifying all of the fields in the Select statement. No matter what I did,
it always came back with the error of:
Insert Error: Column name or number of supplied values does not match table
definition.
It was odd because that table was the exact same as the other one. So I
just changed it to "*" and all worked well.
--
Sketchy
"TheSQLGuru" wrote:
> Just add the new field that DOES make a row unique to the not exists clause:
>
> > INSERT phones
> > SELECT <column list>
> > FROM phonelogstaging as A
> > WHERE NOT EXISTS
> > (SELECT * FROM phones as B
> > WHERE A.GlobalCallID = B.GlobalCallID
> and a.CallNumber = b.CallNumber)
> You can use as many fields as you need to ensure uniqueness.
> --
> TheSQLGuru
> President
> Indicium Resources, Inc.
> "sketchy" <sketchy@.discussions.microsoft.com> wrote in message
> news:CCA5542F-5C15-413A-8B8F-D2200C1BEAD3@.microsoft.com...
> > Hi Roy,
> >
> > Thank you SO MUCH for your quick response.
> >
> > 1. Yes, only new data needs to be added. No data needs to be refreshed,
> > so
> > that's good.
> >
> > 2. One little wrinkle in the plan is that much to my dismay, it does
> > appear
> > that the "GlobalCallID" field isn't necessarily a unique number, but I do
> > have a field adjacent to it ("CallNumber") where no records would ever
> > have
> > the same combination of the two. How would you ammend your last statement
> > to
> > accomodate for this? (Ugh, I know... that probably doesn't make things
> > simpler)
> > --
> > Sketchy
> >
> >
> > "Roy Harvey" wrote:
> >
> >> On Thu, 23 Aug 2007 10:20:00 -0700, sketchy
> >> <sketchy@.discussions.microsoft.com> wrote:
> >>
> >> >Okay, so here is what I have done. (this is a SQL 2005 DB by the
> >> >way...)
> >> >
> >> >1. I have my original table, which will be used for my reporting needs,
> >> >called 'phones'. It has a handfull of fields, (e.g. "Field1" "Field2"
> >> >"Field3" etc. but the main field that has the unique info in it is
> >> >called
> >> >"GlobalCallID"
> >> >
> >> >2. I have created a new table, which will be used for my staging of
> >> >data,
> >> >called 'phonelogstaging'. This database has the EXACT same field names
> >> >and
> >> >types.
> >> >
> >> >3. I've set up SSIS to import my log files into the staging table
> >> >("phonelogstaging"). It wipes out any previous data in this staging
> >> >table,
> >> >so there is no possibility of duplicates in this table. Everything is
> >> >working good there.
> >> >
> >> >Both tables have a field called GlobalCallID that has the unique number
> >> >in
> >> >it that I should be able to check against.
> >> >
> >> >So considering the above, how would my statement look?
> >>
> >> Assuming that data already in the table does not need to be refreshed,
> >> only new data added:
> >>
> >> INSERT phones
> >> SELECT <column list>
> >> FROM phonelogstaging as A
> >> WHERE NOT EXISTS
> >> (SELECT * FROM phones as B
> >> WHERE A.GlobalCallID = B.GlobalCallID)
> >>
> >> If the incoming data itself has duplicates, add DISTINCT after the
> >> word SELECT.
> >>
> >> If you need to refresh the rest of the columns of matching rows from
> >> the staging data, you would also run the following BEFORE the command
> >> above.
> >>
> >> UPDATE phones
> >> SET col1 = A.col1,
> >> col2 = A.col2
> >> FROM phonelogstaging as A
> >> WHERE phones.GlobalCallID = A.GlobalCallID
> >>
> >> Roy Harvey
> >> Beacon Falls, CT
> >>
>
>

No comments:

Post a Comment