Friday, March 30, 2012

Remove duplicates within pipeline

I have a situation where we get XML files sent daily that need uploading into SQL Server tables, but the source system producing these files sometimes generates duplicate records in the file. The tricky part is, that the record isn't entirely duplicated. What I mean, is that if I look for duplicates by grouping the key columns, having count(*) > 1, I find which ones are duplicates, but when I inspect the data on these duplicates, the other details in the remaining columns may differ. So our rule is: pick the first record, toss the rest of the duplicates.

Because we don't sort on any columns during the import, the first record kept of the duplicates is arbitrary. Again, we can't tell at this point which of the duplicated records is more correct. Someday down the road, we will do this research.

Now, I need to know the most efficient way to accomplish this in SSIS. If it makes it easier, I could just discard all the duplicates, since the number of them is so small.

If the source were a relational table, I could use a SQL statement to filter the records to remove the duplicates, but since the source is an XML file, I don't know how to filter these out in the pipeline, since the file has to be aggregated to search for dups.

Thanks

Kory

Never mind... I think I found exactly what I needed: The Sort Transform.

-Kory

|||

The only way I can think is to use the sort or aggregate transformations. Have you explore those? Notice that those are full blcoking transformation, so memory usage and performance are things you may want to check.

Rafael Salas

|||

Yes, I thought the sort transform would do the trick- and it did for small files. Files with < 500,000 rows sorted immediately, within 5-10 seconds. Files > 500,000 or so just hung. Looking at task manager, the DTSDebugHost.exe kept climbing and my overall memory consumption was > 5G and I only have 3G total on the server.

I would have thought the performance was linearly decrease, and not go from 10 seconds to indefinite for just 200K rows more.

I've downloaded and installed the Extrasort component, but get an error when I try to put it on the design surface. It complains that it wasn't installed correctly. I've uninstalled and reinstalled it twice. I know NSort is another option, but I really am not needing sorting functionality, just removing duplicates.

SSIS comes with a sample solution that builds a component for removing duplicates, but as far as I can tell, the fields to pick to determine duplicates are the only fields that it passes through the pipeline. I need to remove dups based on 3 fields, but pass through the rest of the fields, like the sort component does.

Any other ideas out there?

Thanks

Kory

|||

Hi I'm having the same problem

does the ssis have the capabilities of seperating the duplicate records ? or still i use the query? can you give me some advice on this ?

KoryS wrote:

I have a situation where we get XML files sent daily that need uploading into SQL Server tables, but the source system producing these files sometimes generates duplicate records in the file. The tricky part is, that the record isn't entirely duplicated. What I mean, is that if I look for duplicates by grouping the key columns, having count(*) > 1, I find which ones are duplicates, but when I inspect the data on these duplicates, the other details in the remaining columns may differ. So our rule is: pick the first record, toss the rest of the duplicates.

Because we don't sort on any columns during the import, the first record kept of the duplicates is arbitrary. Again, we can't tell at this point which of the duplicated records is more correct. Someday down the road, we will do this research.

Now, I need to know the most efficient way to accomplish this in SSIS. If it makes it easier, I could just discard all the duplicates, since the number of them is so small.

If the source were a relational table, I could use a SQL statement to filter the records to remove the duplicates, but since the source is an XML file, I don't know how to filter these out in the pipeline, since the file has to be aggregated to search for dups.

Thanks

Kory

|||What version of ExtraSort are you using and what platform is it running on?

Had no problems running ExtraSort file version 1.0.0.3 (98,304 bytes) on a 32 bit dev platform Win XP SP2 as well as Win2k3 SP1. The SQL Server Build on both is 2153 , which "everyone" running IS should be on at this point. Have been unable to get ExtraSort to run natively on x64.

By default, the component installs to C:\Program Files\Ivolva Digital\ExtraSort Component\ExtraSort.dll.

sql

No comments:

Post a Comment