Programming Samples

Click here to go to:



Excel VBA

Word VBA

MS Access

T-SQL

SSIS

SSRS

Power BI

Crystal Reports

SSAS

SQL Replication

C# Code

ASP .NET Code

Oracle PL/SQL

Database Diagramming


Back to Home Page


SSIS - FuzzyLookup

Fuzzy Lookup Data Transformation

This article explains how to use Fuzzy Lookup vs Lookup Data Flow task. An Excel file will be used as the data source and a SQL Server table will be used as the reference data source to perform the Fuzzy Matching. An additional SQL Server table will be created and used to receive the matching data from the task as inserted rows.

Fuzzy Lookup Background

SSIS Lookup Transform tasks retrieve only exact matches (Lookup Match) or non matches (Lookup No Match). To match items that are similar, but not exact, SSIS provides a different transform task called Fuzzy Lookup.

For example, if I have a list of items that I want to lookup in another list but they don't exactly match, I can use a Fuzzy Lookup for my Source table. The Fuzzy Lookup will connect to a 2nd table (called a Reference table) to retrieve and attempt to match values based on a percentage of "similarity" that you will provide in the task. I will have produced a list of potential matches for my Source Data that I can review.

Using Fuzzy Lookup

Open a new Visual Studio Project: Business Intelligence Projects, Integration Services Project.

SSIS Bus Intelligence Project

Add a Data Flow task to the Control Flow tab.

Data Flow

Right click in the Connection Manager section and click on New Connection. This is the point where a connection is made to the Source Data (the data that you want to attempt to match). This project uses an Excel file with Places in Hawaii that I want to match to another file.

Fuzzy Lookup Connections - Excel & OLEDB

New Connection

Select Excel as the Connection Manager type.

Excel Connection

Browse to locate the file to use as the data source for the lookup of the matching data (a list of places in the counties of Hawaii).

Excel File

Drag and drop an Excel Data Source to the Data Flow and right click to Edit it.

Excel Source

Select the Excel Connection Manager created as the connection and select the Excel sheet with the data.

Excel Source Connection

Select the columns to use in the lookup process.

Column Select

Right click on the Connection Manager and select New OLE DB Connection to add. This will be the source of the reference table.

OLE DB Connection

Select the Server Name from the drop down box and select the database name to connect.

Connection Manager

The Fuzzy Lookup Task

Drag and drop a Fuzzy Lookup transformation to the Data Flow and right click to edit it.

Fuzzy Lookup

Select the Connection and the table name from the drop down boxes. Select Store new index and allow the default name to be populated.

Reference Table

Click on the columns tab to join the columns to be matched. In this case, Place from the Source Data to be "fuzzy" matched to the Reference Table in SQL Server.

Column Join

Click on the Advanced tab to select the Maximum number of matches you want to see and the level of similarity between the data source and reference data. I am using a threshold of 80% to restrict the number of matches. Going with a lower similarity threshold will increase the number of matches depending on the Maximum number of matches setting.

Advanced Fuzzy Lookup Transformation

Fuzzy Lookup Data Output Storage

Create a table in SQL Server to store the matching data and related similarity, confidence fields.

CREATE TABLE [dbo].[FuzzyLookup](
ID int identity(1,1) not null primary key clustered,
[Place] [nvarchar](255) NULL,
[PlaceNames] [nvarchar](255) NULL,
[Similarity] [real] NULL,
[Confidence] [real] NULL,
[SimilarityPlace] [real] NULL )

Drag and drop and OLE DB Destination on the data flow and drag the green arrow from Fuzzy Lookup on to it. Right click on the OLE DB destination to connect it and map the paths.

OLE DB Connection

Set the connection manager and newly created table name in the drop down boxes.

Connection OLE DB Destination

Map the paths in the Mappings section by clicking on the Available Input Columns and dragging them to the Available Destination Columns. Make sure to match up the similarity and confidence columns to see how closely the data matches.

OLE DB join

Fuzzy Lookup Data Viewers

A preview of the data before it is processed would be a nice addition to the SSIS task. A Data Viewer would accomplish this easily.

Right click on the green arrow between the Fuzzy Lookup and the OLE DB Destination to add a Data Viewer.

Data Viewer

Click Add and select Grid from the Pop up. Click OK to accept the default name for the data viewer.

Grid Viewer

Click OK to close out.

Grid Viewer Data Path

Execute SQL Task - Clearing the Fuzzy Lookup Data Output Table

As this task could be repeated (perhaps with adjustments made to the Similarity in the Fuzzy Lookup), we do not want to continuously append the data to the table. To clear the data from the table, an addition to the Control Flow to remove the data should be added.

Click in the Control Flow tab and add an Execute SQL task. Join the task by dragging the green arrow to the Data flow task.

Execute SQL Task

Right click to edit it and select the Connection and click on the ellipsis (...) box under SQL Statement.

Execute SQL Task Editor

Type in the command to remove all data from the table. Truncate uses less system resources than a delete statement.

SQL Command

Execute the SSIS package. Note that the Data Viewer pops up after the fuzzy lookup prior to the transfer to the database table.

Execute SSIS

In SSMS, execute a SELECT statement to verify the data was transferred into the table.

SELECT * FROM dbo.FuzzyLookup

Data Results

The data results show that any match that was at least 80% similar or above is a potential match. Fuzzy Match does a pretty nice job of checking out potential duplicates in your data.

NOTES: Fuzzy Lookup and Excel Data Source on 64 bit machines SSIS SQL Server 2008R2* - Two interesting problems happened when I attempted to perform a Fuzzy Lookup on SSIS on a 64 bit Windows 7 PC that had Office 2010 (32 bit version) installed. In 64 bit mode, I would execute the task and it would "hang" during the Fuzzy Lookup and never make it to the final destination with the data. The reason was that it crashed because the 32 bit Excel Source file could not transform in the 64 bit environment. Problem #2 happened when I went to Project, Properties, Debugging and changed the SSIS project to 32 bit mode (Run64BitRuntime = false). The Fuzzy Lookup would no longer execute. The solution I used was to separate the Excel file into it's own package in a separate project and have it transform to an OLE DB Destination and in the other project, perform the Fuzzy Lookup between the 2 OLE DB Connections and push the final data into the Fuzzy Lookup table in SQL Server.

*I did not have this trouble on Windows XP with Office 2003 in SSIS Visual Studio 2008 and I currently do not have any trouble with Fuzzy Lookup and Excel in the same package on Windows 7 with Office 2007 in SSIS 2012 SQL Server Data Tools (both 64 bit and 32 bit mode worked).