SSIS Data Quality Services
DQS for SQL Server 2012
This article describes how to set up a DQS Domain within a Knowledge Base in the Data Quality Client and Add Values to it. Then I will create an SSIS project that will use DQS Cleansing to transform the data from a SQL Server table using this new DQS Domain. The results will be pushed out to an Excel file. The prerequistes are that you have already installed Data Quality Services from your SQL Server media and SSIS and have already used the Data Quality Server Installer to create the 3 DQS databases: DQS_MAIN, DQS_PROJECTS, DQS_STAGING_DATA on your instance of SQL Server 2012. I am using SQL Server 2012 Developer Edition.
Create a Domain in a Knowledge Base in DQS
Once DQS is up and running, you can go to the Start Menu and launch the Data Quality Client.
Click on Open Knowledge Base to view all of the available Knowledge Bases.
Select the Knowledge Base that you want to open and click the Next button at the bottom right side of the screen.
Create the new Domain within this Knowledge Base. This project will standardize a mix of abbreviations for US States and Canadian Provinces as well as a Country Code to demonstrate how DQS works. After you give the Domain a name and a brief description to remind you of what it is, click OK to create it.
Select the Domain Values tab. Click on the button with a Green Plus Sign near the Statistics heading to create a new value or after you create the 1st value in the list, right click on the Value area to get the Context Menu and select Create to enter another Value.
For this project, I have a table in the AdventureWorks database called tblSalesAddresses with an AddressLine1, City, State, Zip, and Country. I am creating a set of values that will replace the long version of the State or Province or Country (in the StateProv Field) witha 2 or 3 letter abbreviation. The picture below shows the data from the table.
As I create the values, I am setting them as Synonyms so that DQS will recogize these values as Valid and replace the longer version of the State or Province with the correct abbreviation. To Set 2 values as Synonyms, simply highlight both and right click and select Set as Synonyms.
The values in this sample show that when DQS encounters the value AB during the Cleansing process, it will leave it alone as it is Correct. If it encounters Alberta during the cleansing process, it will change the Value to AB.
When setting values as Synonyms, DQS might automatically place the wrong value as the Lead Value. In this example, I do not want TN replaced, I want Tennessee replaced. I right click on the value TN and select Set as Leading to correct the Synonyms.
Once my list is complete, I click the Finish button in the corner and get this Pop Up informing me to Publish the data added to the Knowledge Base. Click Publish to finish.
Creating a DQS Cleansing Transform Task in SSIS
Open SSIS and create a new Integrations Services Project. Add a Data Flow task and Right click in the Connection Manager section and select New Connection. Select DQS from the options in the Pop Up box.
Select your DQS Server Name from the drop down and click OK to continue.
Add an OLE DB source and Connection to the AdventureWorks database to your table with the AddressLine1 and City, StateProv, ZipCode, and Country fields. Add a DQS Cleansing Transform to the Project. Add an Excel Destination and new Excel Connection to the project. I created an Excel file with the same field names as my table plus some additional fields that will capture the DQS Cleansing output: SalesPersonID, SalesName, Address1, City, StateProv, ZipCode, CountryRegionName, StateProvOutput, StateStatus, StateConfidence, Reason. The Data Flow tab should look similar to this:
Click on and Edit the DQS task to set up the Connection in Connection Manager. Then Map the StateProv field from the OLE DB table to the StateProvNames Domain in the Knowledge Base.
Click on the Advanced tab to add the Confidence and Reason fields to the output. Click OK to close.
On the Excel Destination, edit the Mappings to the output fields to the corresponding Excel fields and click OK when complete.
Run the task. DQS will transform the table data and return it to the Excel file destination. Once it is Successfully complete, you can verify your results in the Excel file. DQS will provide you with the corrected output, tell you the Status of the field, and provide a Reason that the data was changed to the Leading Value of the Synonyms in the Domain.
A note about DQS: I was looking for a solution to standardize some data and found that DQS can only transform EXACT matches. I was trying use a Domain in the Knowledge Base called StreetNames which would take Road and correct it to Rd., Street and correct it to St., Parkway and convert it to Pkwy. I wanted to run the data through the DQS Cleansing Task and have it perform a Fuzzy Match to the data to find a portion of the data within the column and correct it for me. It failed. If my Domain contained complete addresses to correct to like use 10203 Acorn Avenue and Correct to: 10203 Acorn Ave., then it would work. I do not wish to create a Domain with this much data in it, however, there are services through Windows Azure Marketplace where you can sign up for Free or Paid datasets, data dimensions, and quality services from companies like Melissa Data, D & B, and Infogroup. These companies maintain data points on many items such as addresses, company names, phone numbers, email addresses, and more. You simply access it from the Data Quality Client by pressing the Configuration button and clicking on Create a Data Market ID to launch a browser window to the Azure Marketplace for Data Quality Services.