+1 514-978-9422 •

Repository

Python FuzzyWuzzy

Python FuzzyWuzzy Aug 28, 2020

FuzzyWuzzy is a Python package used for approximate string matching. In the following case it is used for data cleaning where a field would be typed in manually by a user instead of using a dropdown combobox.

Import

Import the required Pandas and FuzzyWuzzy packages.

# Python 3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Import the Excel data

Import the Excel Speardsheet into a Pandas DataFrame. This speadsheet would contain a column with manually typed categorical data that you want to clean. There can be a number of issues with data typed manually instead of being selected form a predefined list of values in a ComboBox. There can be typos and inconsistent categorization.

# Import the Excel file containing the data
df = pd.read_excel(r'C:\Users\Grant\Documents\Sales_Report.xlsx')

Proper categorization

Create a clean list containing the standardized values which will replace the manually typed values in the data column from the imported spreadsheet. In this case we will imagine that the names of sales people where typed manually. For instance, "Mary Johnson" could have therefore been entered properly or as "M. Johnson" or "Mary Johnnson".

# Clean list of sales people
SalesPeopleList= ['Mary Johnson', 'Robert Williams', 'James Miller', 'Patricia Brown', 'Charles Lewis']

FuzzyWuzzy matching process

Now let's match the standardized list of sales people (SalesPeopleList) with the data column (SalesPerson) in the Dataframe. The script loops through is row of the SalesPerson column in the dataframe and determines the closest match to the list of Sales People. For each row, a new item is appended to the MatchColumn array.

# Match the data in the dataframe column with the clean values
i = 0
match_col = []
for m in df['SalesPerson']:
    h = 0
    ans = ''
    if m == '':
        match_col.append('')
        continue
    for a in SalesPeopleList:
        k = fuzz.ratio(a, m)
        if h < k:
            h = k
            ans = a
    MtachColumn.append(ans)

Add the matched column to the dataframe

Once the matching process is completed, we can then add the resulting new data from the MatchCol array to the DataFrame as SalesPersonMatch.

# Add the new column containing the matched SUV models
df['SalesPersonMatch'] = MatchColumn

Caveats

You might encounter issues where the standardized list contains very similar values. This might require a workaround like a first pass to split the data and matching the resulting data against seperate standardized lists.



Tags: Python FuzzyWuzzy