SSIS - FuzzyLookup
Fuzzy Lookup Data Transformation
This article explains how to use Fuzzy Lookup vs Lookup Data Flow task. An Excel file will be used as the data source and a SQL Server table will be used as the reference data source to perform the Fuzzy Matching. An additional SQL Server table will be created and used to receive the matching data from the task as inserted rows.
Fuzzy Lookup Background
SSIS Lookup Transform tasks retrieve only exact matches (Lookup Match) or non matches (Lookup No Match). To match items that are similar, but not exact, SSIS provides a different transform task called Fuzzy Lookup.
For example, if I have a list of items that I want to lookup in another list but they don't exactly match, I can use a Fuzzy Lookup for my Source table. The Fuzzy Lookup will connect to a 2nd table (called a Reference table) to retrieve and attempt to match values based on a percentage of "similarity" that you will provide in the task. I will have produced a list of potential matches for my Source Data that I can review.
Using Fuzzy Lookup
Open a new Visual Studio Project: Business Intelligence Projects, Integration Services Project.
Add a Data Flow task to the Control Flow tab.
Right click in the Connection Manager section and click on New Connection. This is the point where a connection is made to the Source Data (the data that you want to attempt to match). This project uses an Excel file with Places in Hawaii that I want to match to another file.
Fuzzy Lookup Connections - Excel & OLEDB
Select Excel as the Connection Manager type.
Browse to locate the file to use as the data source for the lookup of the matching data (a list of places in the counties of Hawaii).
Drag and drop an Excel Data Source to the Data Flow and right click to Edit it.
Select the Excel Connection Manager created as the connection and select the Excel sheet with the data.
Select the columns to use in the lookup process.
Right click on the Connection Manager and select New OLE DB Connection to add. This will be the source of the reference table.
Select the Server Name from the drop down box and select the database name to connect.
The Fuzzy Lookup Task
Drag and drop a Fuzzy Lookup transformation to the Data Flow and right click to edit it.
Select the Connection and the table name from the drop down boxes. Select Store new index and allow the default name to be populated.
Click on the columns tab to join the columns to be matched. In this case, Place from the Source Data to be "fuzzy" matched to the Reference Table in SQL Server.
Click on the Advanced tab to select the Maximum number of matches you want to see and the level of similarity between the data source and reference data. I am using a threshold of 80% to restrict the number of matches. Going with a lower similarity threshold will increase the number of matches depending on the Maximum number of matches setting.
Fuzzy Lookup Data Output Storage
Create a table in SQL Server to store the matching data and related similarity, confidence fields.
CREATE TABLE [dbo].[FuzzyLookup](
ID int identity(1,1) not null primary key clustered,
[Place] [nvarchar](255) NULL,
[PlaceNames] [nvarchar](255) NULL,
[Similarity] [real] NULL,
[Confidence] [real] NULL,
[SimilarityPlace] [real] NULL
)
Drag and drop and OLE DB Destination on the data flow and drag the green arrow from Fuzzy Lookup on to it. Right click on the OLE DB destination to connect it and map the paths.
Set the connection manager and newly created table name in the drop down boxes.
Map the paths in the Mappings section by clicking on the Available Input Columns and dragging them to the Available Destination Columns. Make sure to match up the similarity and confidence columns to see how closely the data matches.
Fuzzy Lookup Data Viewers
A preview of the data before it is processed would be a nice addition to the SSIS task. A Data Viewer would accomplish this easily.
Right click on the green arrow between the Fuzzy Lookup and the OLE DB Destination to add a Data Viewer.
Click Add and select Grid from the Pop up. Click OK to accept the default name for the data viewer.
Click OK to close out.
Execute SQL Task - Clearing the Fuzzy Lookup Data Output Table
As this task could be repeated (perhaps with adjustments made to the Similarity in the Fuzzy Lookup), we do not want to continuously append the data to the table. To clear the data from the table, an addition to the Control Flow to remove the data should be added.
Click in the Control Flow tab and add an Execute SQL task. Join the task by dragging the green arrow to the Data flow task.
Right click to edit it and select the Connection and click on the ellipsis (...) box under SQL Statement.
Type in the command to remove all data from the table. Truncate uses less system resources than a delete statement.
Execute the SSIS package. Note that the Data Viewer pops up after the fuzzy lookup prior to the transfer to the database table.
In SSMS, execute a SELECT statement to verify the data was transferred into the table.
SELECT * FROM dbo.FuzzyLookup
The data results show that any match that was at least 80% similar or above is a potential match. Fuzzy Match does a pretty nice job of checking out potential duplicates in your data.
NOTES: Fuzzy Lookup and Excel Data Source on 64 bit machines SSIS SQL Server 2008R2* - Two interesting problems happened when I attempted to perform a Fuzzy Lookup on SSIS on a 64 bit Windows 7 PC that had Office 2010 (32 bit version) installed. In 64 bit mode, I would execute the task and it would "hang" during the Fuzzy Lookup and never make it to the final destination with the data. The reason was that it crashed because the 32 bit Excel Source file could not transform in the 64 bit environment. Problem #2 happened when I went to Project, Properties, Debugging and changed the SSIS project to 32 bit mode (Run64BitRuntime = false). The Fuzzy Lookup would no longer execute. The solution I used was to separate the Excel file into it's own package in a separate project and have it transform to an OLE DB Destination and in the other project, perform the Fuzzy Lookup between the 2 OLE DB Connections and push the final data into the Fuzzy Lookup table in SQL Server.
*I did not have this trouble on Windows XP with Office 2003 in SSIS Visual Studio 2008 and I currently do not have any trouble with Fuzzy Lookup and Excel in the same package on Windows 7 with Office 2007 in SSIS 2012 SQL Server Data Tools (both 64 bit and 32 bit mode worked).