Applied Data Science Notebook in Python for Beginners to Professionals

Data Science Project – A Guide to Calculate Correlation Between Variables for Machine Learning in Python

Machine Learning for Beginners - A Guide to Calculate Correlation Between Variables for Machine Learning in Python

For more projects visit: https://setscholars.net

  • There are 5000+ free end-to-end applied machine learning and data science projects available to download at SETSscholar. SETScholars is a Science, Engineering and Technology Scholars community.
In [8]:
# Suppress warnings in Jupyter Notebooks
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In this notebook, we will learn how to Calculate Correlation Between Variables for Machine Learning in Python.

Python Codes

Create a simulated dataset

In [9]:
# generate related variables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from matplotlib import pyplot

# seed random number generator
seed(412)

# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data2), std(data2)))
print()

# plot
pyplot.figure(figsize=(12,8))
pyplot.scatter(data1, data2)
pyplot.show()
data1: mean=98.807 stdv=18.773
data2: mean=148.807 stdv=21.149

Calculate Correlation Between Variables for Machine Learning in Python

Covariance

In [10]:
# calculate the covariance between two variables
from numpy.random import randn
from numpy.random import seed
from numpy import cov

# seed random number generator
seed(412)

# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# calculate covariance matrix
covariance = cov(data1, data2)
print(covariance)
[[352.78813306 348.56882819]
 [348.56882819 447.74847497]]
In [ ]:
 

Pearson’s Correlation

In [11]:
# calculate the Pearson's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr

# seed random number generator
seed(412)

# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)
Pearsons correlation: 0.877
In [ ]:
 

Spearman’s Correlation

In [12]:
# calculate the spearmans's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import spearmanr

# seed random number generator
seed(412)

# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# calculate spearman's correlation
corr, _ = spearmanr(data1, data2)
print('Spearmans correlation: %.3f' % corr)
Spearmans correlation: 0.861
In [ ]:
 

Summary

In this coding recipe, we discussed how to Calculate Correlation Between Variables for Machine Learning in Python.

Specifically, we have learned the followings:

  • How to calculate Correlation Between Variables for Machine Learning in Python.
In [ ]: