+1 514-978-9422 •

Repository

Working with Pandas dataframes

Index Python Pandas Aug 28, 2020

Pandas is a Python library that allows you to work on two-dimensional datasets (dataframes), notably on imported Excel and CSV files. It will not only give you the power to analyze large datasets easily but also the ability to manipulate dataframes in a systematic and repeatable way, allowing you to automate complex tasks performed in Excel. You could imagine automating your monthly reporting, consisting of merging datasets, performing vlookups, cleaning data and so forth...

Importing Excel and CSV files

Before working with dataframes, you might want to check out the following post that will guide you through importing datasets from an Excel or CSV file :

Import data from Excel & CSV files
Import data from Google Sheets

Customer dataframe (df)

Let us now work with a basic customer dataset that we will have imported from Excel.

CUST_ID CUST_NAME INDUSTRY SINCE STATE EMPLOYEES
00014 Jabberbean Technical 1995-10-14 Arizona 120
00028 Twiyo Computer Software 1996-01-22 Texas 67
00003 Twiyo Computer Software 1997-06-15 California 42

Selecting Columns

Many times when end up working on dataframes which more columns of data than we actually require. Selecting or removing columns can add some clarity to our work and might also be necessary before merging or grouping dataframes.

import pandas as pd

# Let's work on a CUSTOMERS dataframe

# Keep specific columns : 'CUST_ID' & 'CUST_NAME'
df = df[['CUST_ID', 'CUST_NAME']] 

# Drop a single column 'POSTAL_CODE'
df = df.drop('POSTAL_CODE', axis = 1)

# Drop mutliple columns seperated by a comma
df = df.drop(['POSTAL_CODE', 'STATE'], axis = 1) #or
df = df.drop(['POSTAL_CODE', 'STATE'], axis = 'columns')

Renaming column headers

Column headers in a dataframe can sometimes be pretty messy. You might want to clean them up and give them more meaningful names.

import pandas as pd

# Let's continue working a CUSTOMERS dataframe

# Renaming a single columns
df = df.rename(columns={'CUST_ID':'ID'}, inplace=True)

# Renaming multiple columns seperated by a comma
df = df.rename(columns={'CUST_ID':'ID', 'CUST_NAME': 'NAME'}, inplace=True)

# Replacing characters in all columns : underscore "_" by a space " "
df.columns = df.columns.str.replace('_', ' ')
# 'CUST_CD' becomes 'CUST CD'

# Converting all characters to lowercase.
df.columns = df.columns.str.lower()
# 'CUST NAME' becomes 'cust name'

# Converting all characters to uppercase.
df.columns = df.columns.str.upper()
# 'cust name' becomes 'CUST NAME'

# Converting first character of each word to uppercase and remaining to lowercase.
df.columns = df.columns.str.title()
# 'cust name' becomes 'Cust Name'

# Converting first character to uppercase and remaining to lowercase.
df.columns = df.columns.str.capitalize()
# 'cust name' becomes 'Cust name'

Converting data types in a column

You can convert values in a column to strings, numeric values, dates, etc...

import pandas as pd

# Printing the data types contained in the dataframe
print(df.dtypes)

# Converting a column to a string
df['CUST_ID'] = df['CUST_ID'].astype(str)

# Converting a column to floats (decimal)
df['EMPLOYEES'] = df['EMPLOYEES'].astype(float)

# Converting a column to integers
df['EMPLOYEES'] = df['EMPLOYEES'].astype(int)

# Converting a column to numeric values
df['EMPLOYEES'] = pd.to_numeric(df['EMPLOYEES'], errors = 'coerce')

# Converting a column to date and time
from pandas import datetime
df['SINCE'] = pd.to_datetime(df['SINCE'])

N/A values

Here is a little basic help working around N/A values.

import pandas as pd

# Exclude rows from a dataframe with N/A values in a specific column
df = df[df['STATE'].notna()]

# Replace N/A values with a fixed number
df = df.fillna(0)

# Replace N/A values with the median value
import math
median_employees = math.floor(df.EMPLOYEES.median())  #floor for integer
df.EMPLOYEES.fillna(median_employees)

# df.employees is the same as df['EMPLOYEES']

Filtering data columns

When analyzing data you will usually be required to filter data.

import pandas as pd

# Filtering on a single column

# Exclude 'N/A' values
df = df[df['STATE'].notna()]

# State is equal to 'Arizona'
df = df.loc(df['STATE'] == 'Arizona')

# All states excluding 'California'
df = df.loc(df['STATE'] != 'California')

# Companies with 100 employees or more
df = df.loc(df['EMPLOYEES'] >= 100)

# Filtering on multiple columns

# State is equal to 'California' or 'Texas'
df = df.loc[(df['STATE'] = 'California') | (df['STATE'] = 'Texas')]

# State is equal to 'California' and has less than 100 employees
df = df.loc[(df['STATE'] = 'California') & (df['EMPLOYEES'] < 100)]


Tags: Index Python Pandas