# Applied Data Science Notebook in Python for Beginners to Professionals¶

## Data Science Project – A Guide to detect and remove outliers for Machine Learning in Python¶

### Machine Learning for Beginners - A Guide to detect and remove outliers for Machine Learning in Python¶

For more projects visit: https://setscholars.net

• There are 5000+ free end-to-end applied machine learning and data science projects available to download at SETSscholar. SETScholars is a Science, Engineering and Technology Scholars community.
In [5]:
# Suppress warnings in Jupyter Notebooks
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')


In this notebook, we will learn how to detect and remove outliers for Machine Learning in Python.

## Python Codes¶

### Create a simulated dataset¶

In [6]:
# generate gaussian data
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std

# seed the random number generator
seed(1)

# generate univariate observations
data = 5 * randn(10000) + 50

# summarize
print('mean=%.3f stdv=%.3f' % (mean(data), std(data)))

mean=50.049 stdv=4.994


## Detect and Remove Outliers for Machine Learning in Python¶

### Standard Deviation Method¶

In [7]:
# identify outliers with standard deviation
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std

# seed the random number generator
seed(1)

# generate univariate observations
data = 5 * randn(10000) + 50

# calculate summary statistics
data_mean, data_std = mean(data), std(data)

# identify outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print(); print('Identified outliers: %d' % len(outliers))

# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print(); print('Non-outlier observations: %d' % len(outliers_removed))

Identified outliers: 29

Non-outlier observations: 9971

In [ ]:



### Interquartile Range Method¶

In [8]:
# identify outliers with interquartile range
from numpy.random import seed
from numpy.random import randn
from numpy import percentile

# seed the random number generator
seed(1)

# generate univariate observations
data = 5 * randn(10000) + 50

# calculate interquartile range
q25, q75 = percentile(data, 25), percentile(data, 75)
iqr = q75 - q25

print(); print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr))

# calculate the outlier cutoff
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off

# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print(); print('Identified outliers: %d' % len(outliers))

# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print(); print('Non-outlier observations: %d' % len(outliers_removed))

Percentiles: 25th=46.685, 75th=53.359, IQR=6.674

Identified outliers: 81

Non-outlier observations: 9919

In [ ]:



## Summary¶

In this coding recipe, we discussed how to detect and remove outliers for Machine Learning in Python.

Specifically, we have learned the followings:

• How to detect and remove outliers for Machine Learning in Python.
In [ ]: