Python Pandas must know | detect the outliers in the dataset - Jacky Yuan | Digital Marketing Consultant
post-template-default,single,single-post,postid-16108,single-format-standard,ajax_fade,page_not_loaded,,qode-title-hidden,qode_grid_1300,qode-content-sidebar-responsive,qode-theme-ver-9.2,wpb-js-composer js-comp-ver-,vc_responsive

Python Pandas must know | detect the outliers in the dataset

In the dataset, if an observation is unusually larger or smaller than other data in the dataset, we call it Suspected outlier. The existence of suspected outliers will have an undue influence on subsequent calculations. It is necessary to detect suspected outliers and properly handle them.

A classic method of calculating the suspected outliers in a data set is the Tukey method. The method first calculates the quartile (Q1) and the quartile (Q3) of the data set to calculate the interquartile range (IQR), which is then less thanQ1 - 1.5IQROr greater thanQ3 + 1.5IQRThe data point is considered to be a suspected outlier. We can use this method to detect outliers in the DataFrame. code show as below:

def detect_outliers(df,n,features):
    outlier_indices = []#create a empty list
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col].dropna(), 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col].dropna(),75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        # outlier step
        outlier_step = 1.5 * IQR
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        # append the found outlier indices for col to the list of outlier indices 
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = [k for k, v in outlier_indices.items() if v > n ]
    return multiple_outliers   

I made two tiny improvements:

  1. When calculating the Q1 and Q3, if the column data has NaN value, it will return NaN. So, drop the Null values first.
  2. Multiple_outliers should be use [] rather than list(). Because this is a list.

I used this technique in Kaggle Titanic competition for remove the outlier.

, , ,