
08 Apr Python Pandas must know | detect the outliers in the dataset
In the dataset, if an observation is unusually larger or smaller than other data in the dataset, we call it Suspected outlier. The existence of suspected outliers will have an undue influence on subsequent calculations. It is necessary to detect suspected outliers and properly handle them.
A classic method of calculating the suspected outliers in a data set is the Tukey method. The method first calculates the quartile (Q1) and the quartile (Q3) of the data set to calculate the interquartile range (IQR), which is then less thanQ1 - 1.5IQR
Or greater thanQ3 + 1.5IQR
The data point is considered to be a suspected outlier. We can use this method to detect outliers in the DataFrame. code show as below:
def detect_outliers(df,n,features): outlier_indices = []#create a empty list # iterate over features(columns) for col in features: # 1st quartile (25%) Q1 = np.percentile(df[col].dropna(), 25) # 3rd quartile (75%) Q3 = np.percentile(df[col].dropna(),75) # Interquartile range (IQR) IQR = Q3 - Q1 # outlier step outlier_step = 1.5 * IQR # Determine a list of indices of outliers for feature col outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index # append the found outlier indices for col to the list of outlier indices outlier_indices.extend(outlier_list_col) # select observations containing more than 2 outliers outlier_indices = Counter(outlier_indices) multiple_outliers = [k for k, v in outlier_indices.items() if v > n ] return multiple_outliers
I made two tiny improvements:
- When calculating the Q1 and Q3, if the column data has NaN value, it will return NaN. So, drop the Null values first.
- Multiple_outliers should be use [] rather than list(). Because this is a list.
I used this technique in Kaggle Titanic competition for remove the outlier.