Handling Missing Data with Python Pandas - Jacky Yuan | Digital Marketing Consultant
post-template-default,single,single-post,postid-16118,single-format-standard,ajax_fade,page_not_loaded,,qode-title-hidden,qode_grid_1300,qode-content-sidebar-responsive,qode-theme-ver-9.2,wpb-js-composer js-comp-ver-,vc_responsive

Handling Missing Data with Python Pandas

Missing values are an important part of actual data analysis. In actual production, there are always a lot of missing values. How to deal with missing values is a critical and important step.

I had the bad experience of messing up a project because I don’t know how to deal with the missing data in my workplace. Although the manager somehow finds the other data source to fulfill the null data, I was still be blamed for the ability to perform the analysis. With my skill set increased, I would like to share this skillset with you, which helps you to avoid the embarrassing situation.

The demo data source I use is a ‘Titanic’ dataset. I will upload here.

In the dataset, the Age has 177 missing age.

We will use three different methods to handel it.

First, let’s import the library.

# data analysis and wrangling
import pandas as pd
import numpy as np

# machine learning
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv('/...../train.csv')
#get the sum of null data

Method 1: Take missing values as a category, for example, use none for categorical types.

#First method, fullfill 'Age' and 'Cabin' as none
cols = ['Age','Cabin']
for col in cols:
#if you only want to fill one column, pretty straightforward, right?
df['Age'] = df['Age'].fillna('none')

If the missing values as a numeric, fullfill 0.

Method 2: Fill in missing values with specific statistical values such as average, median, and mode.

#let's fulfill the age with the median age of the group. 
df['Age'] = df['Age'].fillna(df['Age'].median())

I know, this is not a good idea here for feature engineering.

But we want a better way here, we based the Name column, extract the different titles. And fulfill the null value as each title’s median age.

We will extract the title from the ‘Name’ column. For example, Heikkinen, Miss. Laina, we only want to extract Miss.

We will use regex to extract it.

#convert Name to Title
df['Title'] = [i.split(',')[1].split('.')[0].strip() for i in df['Name']]

Then we are going to replace and mapping some rare titles to more frequent titles like’ Mr’, ‘Mrs’. Finally, apply each title’s medium to its group null values.

# Replacing rare titles with more common ones
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
df.replace({'Title': mapping}, inplace=True)
titles = ['Dr', 'Master', 'Miss', 'Mr', 'Mrs', 'Rev']
for title in titles:
    age_to_impute = df.groupby('Title')['Age'].median()[titles.index(title)]
    df.loc[(df['Age'].isnull()) & (df['Title'] == title), 'Age'] = age_to_impute

There is an alternative way use Lambda function.

#alternative way to write it using lambda function
df['Age1']=df.groupby('Title')['Age'].apply(lambda x: x.fillna(x.median()))

Method 3: Use function prediction and other methods to fill in missing values.

We use other variables to predict Age. Let’s import Random Forest module here.

from sklearn.ensemble import RandomForestRegressor

def set_missing_ages(df):

    #  Take the existing numerical features and throw them into Random Forest Regressor
    age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]

    # Passengers are divided into two parts, known age and unknown age, 
    #converted to numpy format, equivalent to DF of where statement
    known_age = age_df[age_df.Age.notnull()].values
    unknown_age = age_df[age_df.Age.isnull()].values

    # y is the target age
    y = known_age[:, 0]

    # X is the characteristic attribute value
    X = known_age[:, 1:]

    # fit into RandomForestRegressor, -1 means use all processors
    rfr = RandomForestRegressor(random_state=10,n_estimators=1900, n_jobs=-1)
    rfr.fit(X, y)

    #  Use the obtained model to predict unknown age results
    predictedAges = rfr.predict(unknown_age[:, 1:])

    # Fill the original missing data with the obtained prediction results
    df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges 

    return df, rfr

Thanks for watching the tutorial. And I hope this will help you. If you have any comments, please coment below to let me know.