Applied Data Science Notebook in Python for Beginners to Professionals

Data Science Project – A Guide to detect and remove outliers for Machine Learning in Python

Machine Learning for Beginners - A Guide to detect and remove outliers for Machine Learning in Python

For more projects visit: https://setscholars.net

  • There are 5000+ free end-to-end applied machine learning and data science projects available to download at SETSscholar. SETScholars is a Science, Engineering and Technology Scholars community.
In [5]:
# Suppress warnings in Jupyter Notebooks
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In this notebook, we will learn how to detect and remove outliers for Machine Learning in Python.

Python Codes

Create a simulated dataset

In [6]:
# generate gaussian data
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std

# seed the random number generator
seed(1)

# generate univariate observations
data = 5 * randn(10000) + 50

# summarize
print('mean=%.3f stdv=%.3f' % (mean(data), std(data)))
mean=50.049 stdv=4.994

Detect and Remove Outliers for Machine Learning in Python

Standard Deviation Method

In [7]:
# identify outliers with standard deviation
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std

# seed the random number generator
seed(1)

# generate univariate observations
data = 5 * randn(10000) + 50

# calculate summary statistics
data_mean, data_std = mean(data), std(data)

# identify outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print(); print('Identified outliers: %d' % len(outliers))

# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print(); print('Non-outlier observations: %d' % len(outliers_removed))
Identified outliers: 29

Non-outlier observations: 9971
In [ ]:
 

Interquartile Range Method

In [8]:
# identify outliers with interquartile range
from numpy.random import seed
from numpy.random import randn
from numpy import percentile

# seed the random number generator
seed(1)

# generate univariate observations
data = 5 * randn(10000) + 50

# calculate interquartile range
q25, q75 = percentile(data, 25), percentile(data, 75)
iqr = q75 - q25

print(); print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr))

# calculate the outlier cutoff
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off

# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print(); print('Identified outliers: %d' % len(outliers))

# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print(); print('Non-outlier observations: %d' % len(outliers_removed))
Percentiles: 25th=46.685, 75th=53.359, IQR=6.674

Identified outliers: 81

Non-outlier observations: 9919
In [ ]:
 

Summary

In this coding recipe, we discussed how to detect and remove outliers for Machine Learning in Python.

Specifically, we have learned the followings:

  • How to detect and remove outliers for Machine Learning in Python.
In [ ]: