Programming Samples

Click here to go to:



Excel VBA

Word VBA

MS Access

Python

T-SQL

SSIS

SSRS

Power BI

Crystal Reports

SSAS

SQL Replication

C# Code

ASP .NET Code

Oracle PL/SQL

Database Diagramming


Back to Home Page


Python DataSet Comparison

Python Dataset Comparison in Jupyter Notebooks

This article describes how to use Pandas in Python to compare 2 CSV files in Python using DataFrames to find matching and non-matching records. The CSV dataset files were created and downloaded from the convertcsv.com website using the Online Generate Test Data page: https://www.convertcsv.com/generate-test-data.htm. The code will import this data as DataFrames.

 

Load the datasets into DataFrames

Write code to use pandas to read the Names and Address CSV data file into a DataFrame (df1) and read the Names to match CSV file into DataFrame df2. An additional unique integer ID is added to each dataframe.

Python DataSets Import into DataFrames

Code the dataframes to drop the NaN values, if any.

Python dataframe DropNA for NaN values

Add columns for full name - Create code to add a name field to the dataframes for the concatenated First and Last Name columns and strip additional white space on the left of each.

dataframe name fields

Add columns for full address field - Create code to add a combined address field to the dataframes for the concatenated address, city, state, and zip columns and strip additional white space on the left of each.

dataframe address fields

Clean up any extra spaces in the dataframe columns using Regex replace in both dataframes.

dataframe Regex Replacement

Code for Comparing DataFrames write code to match the names from dataframe df2 to the dataframe df1.

Write code to match the names of df2 to dataframe df1. In this sample, names in 3355 rows from df1 match dataframe df2.

DataFrame comparison matching names

Write code to find names from dataframe df2 Not in the dataframe df1. Names in 6645 rows from df1 do not match dataframe df2.

DataFrame comparison to extract non matching names ------------continue with matching 4051 records and outputting tab delimited file with matches to show cartesian product between dataframes due to matching. -------------------------------

Merge the 2 DataFrames by matching names Write code to use an inner merge of the names from dataframe df2 to the dataframe df1.

Merge of both dataframes with matching names

Write code to get the count of the matches by name for the 2 datasets.

Count of the dataframe with matched names

Create code to export the new merge dataframe to a file to review the matched names.

Dataframe with matched names in Tab delimited file for review

The exported data shows that some df2 names match multiple df1 names in the merged dataframe. If we are not concerned with the addresses from df1, then the duplicate names can be removed from this dataframe.

df1 and df2 multiple matched names

Write code to drop the duplicate matches by name for the 2 datasets.

Code to keep only the 1st value of each name and drop the duplicate values with matched names

Write code to get the count of the updated matches by name for the 2 datasets.

Count of the dataframe with duplicate matched names removed

Create code to export the updated merge dataframe to a file to review the removed "matched" names.

Code to generate file with removed matched names

Review the exported data showing that the duplicate names have been removed from the merged dataframe.

Dataframe with removed matched names in Tab delimited file