The k-nearest neighbors algorithm, or kNN, is one of the simplest machine learning algorithms. Usually, k is a small, odd number - sometimes only 1. The larger k is, the more accurate the classification will be, but the longer it takes to perform the classification.
Let’s say you want to classify an object into one of several classes -- for example, "pictures containing a face" and "pictures not containing a face". You do this by looking at the k elements of the training set that are closest to the one you want to classify, and letting them vote by majority on what that object’s class should be. If two of your closest elements were in class A and only one in class B, and k = 3, then you would conclude the element that are you trying to classify would go in class A. "Closest" here refers to literal distance in n-dimensional space, or the Euclidean distance.
There's also something called weighted kNN, which is like kNN except neighbors that are closer count as stronger votes. If there is one example of class A, and two examples of class B that are farther away, the algorithm still might classify the input as class A.
Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares discriminant analysis (PLS-DA) is a variant used when the Y is categorical.
PLS is used to find the fundamental relations between two matrices (X and Y), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. By contrast, standard regression will fail in these cases (unless it is regularized).
Partial least squares was introduced by the Swedish statistician Herman O. A. Wold, who then developed it with his son, Svante Wold. An alternative term for PLS (and more correct according to Svante Wold) is projection to latent structures, but the term partial least squares is still dominant in many areas. Although the original applications were in the social sciences, PLS regression is today most widely used in chemometrics and related areas. It is also used in bioinformatics, sensometrics, neuroscience, and anthropology.
Here we are going to implement KNN, PLS and PDA using Telecom Churn Dataset.
library(DBI)
library(corrgram)
library(caret)
library(gridExtra)
library(ggpubr)
Today is a good practice to start parallelizing your code. The common motivation behind parallel computing is that something is taking too long time. For somebody that means any computation that takes more than 3 minutes – this because parallelization is incredibly simple and most tasks that take time are embarrassingly parallel. Here are a few common tasks that fit the description:
# process in parallel on Windows
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
# process in parallel on Mac OSX and UNIX like systems
library(doMC)
registerDoMC(cores = 4)
#Set working directory where CSV is located
#getwd()
#setwd("...YOUR WORKING DIRECTORY WITH A DATASET...")
#getwd()
# Load the DataSets:
dataSet <- read.csv("TelcoCustomerChurnDataset.csv", header = TRUE, sep = ',')
colnames(dataSet) #Check the dataframe column names
# Print top 10 rows in the dataSet
head(dataSet, 10)
# Print last 10 rows in the dataSet
tail(dataSet, 10)
# Dimention of Dataset
dim(dataSet)
# Check Data types of each column
table(unlist(lapply(dataSet, class)))
# Check Data types of individual column
data.class(dataSet$Account_Length)
data.class(dataSet$Vmail_Message)
data.class(dataSet$Day_Mins)
data.class(dataSet$Eve_Mins)
data.class(dataSet$Night_Mins)
data.class(dataSet$Intl_Mins)
data.class(dataSet$CustServ_Calls)
data.class(dataSet$Intl_Plan)
data.class(dataSet$Vmail_Plan)
data.class(dataSet$Day_Calls)
data.class(dataSet$Day_Charge)
data.class(dataSet$Eve_Calls)
data.class(dataSet$Eve_Charge)
data.class(dataSet$Night_Calls)
data.class(dataSet$Night_Charge)
data.class(dataSet$Intl_Calls)
data.class(dataSet$Intl_Charge)
data.class(dataSet$State)
data.class(dataSet$Phone)
data.class(dataSet$Churn)
dataSet$Intl_Plan <- as.numeric(dataSet$Intl_Plan)
dataSet$Vmail_Plan <- as.numeric(dataSet$Vmail_Plan)
dataSet$State <- as.numeric(dataSet$State)
# Check Data types of each column
table(unlist(lapply(dataSet, class)))
# Find out if there is missing value in rows
rowSums(is.na(dataSet))
# Find out if there is missing value in columns
colSums(is.na(dataSet))
#Checking missing value with the mice package
library(mice)
md.pattern(dataSet)
#Checking missing value with the VIM package
library(VIM)
mice_plot <- aggr(dataSet, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(dataSet[1:21]), cex.axis=.9,
gap=3, ylab=c("Missing data","Pattern"))
After the observation, we can claim that dataset contains no missing values.
# Selecting just columns with numeric data type
numericalCols <- colnames(dataSet[c(1:7,9:20)])
Difference between the lapply and sapply functions (we will use them in the next 2 cells):
We use lapply - when we want to apply a function to each element of a list in turn and get a list back.
We use sapply - when we want to apply a function to each element of a list in turn, but we want a vector back, rather than a list.
#Sum
lapply(dataSet[numericalCols], FUN = sum)
#Mean
lapply(dataSet[numericalCols], FUN = mean)
#median
lapply(dataSet[numericalCols], FUN = median)
#Min
lapply(dataSet[numericalCols], FUN = min)
#Max
lapply(dataSet[numericalCols], FUN = max)
#Length
lapply(dataSet[numericalCols], FUN = length)
# Sum
sapply(dataSet[numericalCols], FUN = sum)
# Mean
sapply(dataSet[numericalCols], FUN = mean)
# Median
sapply(dataSet[numericalCols], FUN = median)
# Min
sapply(dataSet[numericalCols], FUN = min)
# Max
sapply(dataSet[numericalCols], FUN = max)
# Length
sapply(dataSet[numericalCols], FUN = length)
In the next few cells, you will find three different options on how to aggregate data.
# OPTION 1: (Using Aggregate FUNCTION - all variables together)
aggregate(dataSet[numericalCols], list(dataSet$Churn), summary)
# OPTION 2: (Using Aggregate FUNCTION - variables separately)
aggregate(dataSet$Intl_Mins, list(dataSet$Churn), summary)
aggregate(dataSet$Day_Mins, list(dataSet$Churn), summary)
aggregate(dataSet$Night_Mins, list(dataSet$Churn), summary)
# OPTION 3: (Using "by" FUNCTION instead of "Aggregate" FUNCTION)
by(dataSet$Intl_Mins, dataSet[8], FUN = summary)
by(dataSet$Day_Mins, dataSet[8], FUN = summary)
by(dataSet$Night_Mins, dataSet[8], FUN = summary)
# Correlations/covariances among numeric variables
library(Hmisc)
cor(dataSet[c(2,5,11,13,16,18)], use="complete.obs", method="kendall")
cov(dataSet[c(2,5,11,13,16,18)], use="complete.obs")
# Correlations with significance levels
rcorr(as.matrix(dataSet[c(2,5,11,13,16,18)]), type="pearson")
# Pie Chart from data
mytable <- table(dataSet$Churn)
lbls <- paste(names(mytable), "\n", mytable, sep="")
pie(mytable, labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Classes\n (with sample sizes)")
# Barplot of categorical data
par(mfrow=c(1,1))
barplot(table(dataSet$Churn), ylab = "Count",
col=c("darkblue","red"))
barplot(prop.table(table(dataSet$Churn)), ylab = "Proportion",
col=c("darkblue","red"))
barplot(table(dataSet$Churn), xlab = "Count", horiz = TRUE,
col=c("darkblue","red"))
barplot(prop.table(table(dataSet$Churn)), xlab = "Proportion", horiz = TRUE,
col=c("darkblue","red"))
# Scatterplot Matrices from the glus Package
library(gclus)
dta <- dataSet[c(2,5,11,13,16,18)] # get data
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Variables Ordered and Colored by Correlation" )
corrgram(dataSet[c(2,5,11,13,16,18)], order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt, main=" ")
# More graphs on correlatios amaong data
# Using "Hmisc"
res2 <- rcorr(as.matrix(dataSet[,c(2,5,11,13,16,18)]))
# Extract the correlation coefficients
res2$r
# Extract p-values
res2$P
# Using "corrplot"
library(corrplot)
library(RColorBrewer)
corrplot(res2$r, type = "upper", order = "hclust", col=brewer.pal(n=8, name="RdYlBu"),
tl.col = "black", tl.srt = 45)
corrplot(res2$r, type = "lower", order = "hclust", col=brewer.pal(n=8, name="RdYlBu"),
tl.col = "black", tl.srt = 45)
# Using PerformanceAnalytics
library(PerformanceAnalytics)
data <- dataSet[, c(2,5,11,13,16,18)]
chart.Correlation(data, histogram=TRUE, pch=19)
# Using Colored Headmap
col <- colorRampPalette(c("blue", "white", "red"))(20)
heatmap(x = res2$r, col = col, symm = TRUE)
We should notice that Night_Mins and Night_Charge have a strong, linear, positive relationship.
train_test_index <- createDataPartition(dataSet$Churn, p=0.75, list=FALSE)
training_dataset <- dataSet[, c(1:20)][train_test_index,]
testing_dataset <- dataSet[, c(1:20)][-train_test_index,]
dim(training_dataset)
dim(testing_dataset)
control <- trainControl(method="repeatedcv", # repeatedcv / adaptive_cv
number=2, repeats = 2,
verbose = TRUE, search = "grid",
allowParallel = TRUE)
metric <- "Accuracy"
tuneLength = 2
names(getModelInfo())
getModelInfo("pls"); getModelInfo("kknn"); getModelInfo("pda");
# PLS
fit.pls <- caret::train(Churn~., data=training_dataset, method="pls",
metric=metric,
trControl=control,
verbose = TRUE
)
print(fit.pls)
# KKNN
fit.kknn <- caret::train(Churn~., data=training_dataset, method="kknn",
metric=metric,
trControl=control,
verbose = TRUE
)
print(fit.kknn)
# PDA
fit.pda <- caret::train(Churn~., data=training_dataset, method="pda",
metric=metric,
trControl=control,
verbose = TRUE
)
print(fit.pda)
# PLS
fit.pls_preProc <- caret::train(Churn~., data=training_dataset, method="pls",
metric=metric,
trControl=control,
preProc=c("center", "scale"),
verbose = TRUE
)
print(fit.pls_preProc)
# KKNN
fit.kknn_preProc <- caret::train(Churn~., data=training_dataset, method="kknn",
metric=metric,
trControl=control,
preProc=c("center", "scale", "pca"),
verbose = TRUE
)
print(fit.kknn_preProc)
# PDA
fit.pda_preProc <- caret::train(Churn~., data=training_dataset, method="pda",
metric=metric,
trControl=control,
preProc=c("center", "scale", "pca"),
verbose = TRUE
)
print(fit.pda_preProc)
# PLS
fit.pls_automaticGrid <- caret::train(Churn~., data=training_dataset, method="pls",
metric=metric,
trControl=control,
preProc=c("center", "scale"),
tuneLength = tuneLength,
verbose = TRUE
)
print(fit.pls_automaticGrid)
# KKNN
fit.kknn_automaticGrid <- caret::train(Churn~., data=training_dataset, method="kknn",
metric=metric,
trControl=control,
preProc=c("center", "scale", "pca"),
tuneLength = tuneLength,
verbose = TRUE
)
print(fit.kknn_automaticGrid)
# PDA
fit.pda_automaticGrid <- caret::train(Churn~., data=training_dataset, method="pda",
metric=metric,
trControl=control,
preProc=c("center", "scale", "pca"),
tuneLength = tuneLength,
verbose = TRUE
)
print(fit.pda_automaticGrid)
Grid needs to parameterise manually for each particular algorithm
# PLS
grid <- expand.grid(ncomp=c(seq(from = 1, to = 4, by = 0.5)))
fit.pls_manualGrid <- caret::train(Churn~., data=training_dataset, method="pls",
metric=metric,
trControl=control,
preProc=c("center", "scale"),
tuneGrid = grid,
verbose = TRUE
)
print(fit.pls_manualGrid)
plot(fit.pls_manualGrid)
# KKNN
grid <- expand.grid(kmax = c(seq(from = 1, to = 10, by = 1)),
distance = c(seq(from = 1, to = 10, by = 2)),
kernel = c("rectangular", "triangular","epanechnikov")
)
fit.kknn_manualGrid <- caret::train(Churn~., data=training_dataset, method="kknn",
metric=metric,
trControl=control,
preProc=c("center", "scale", "pca"),
tuneGrid = grid,
verbose = TRUE
)
print(fit.kknn_manualGrid)
plot(fit.kknn_manualGrid)
# PDA
grid <- expand.grid(lambda = c(seq(from = 0.1, to = 1.0, by = 0.2)))
fit.pda_manualGrid <- caret::train(Churn~., data=training_dataset, method="pda",
metric=metric,
trControl=control,
preProc=c("center", "scale", "pca"),
tuneGrid = grid,
verbose = TRUE
)
print(fit.pda_manualGrid)
plot(fit.pda_manualGrid)
results <- resamples(list( trained_Model_1 = fit.pls
, trained_Model_2 = fit.kknn
, trained_Model_3 = fit.pda
, trained_Model_4 = fit.pls_preProc
, trained_Model_5 = fit.kknn_preProc
, trained_Model_6 = fit.pda_preProc
, trained_Model_7 = fit.pls_automaticGrid
, trained_Model_8 = fit.kknn_automaticGrid
, trained_Model_9 = fit.pda_automaticGrid
, trained_Model_10 = fit.pls_manualGrid
, trained_Model_11 = fit.kknn_manualGrid
, trained_Model_12 = fit.pda_manualGrid
))
summary(results)
dotplot(results)
bwplot(results)
best_trained_model <- fit.pda_automaticGrid
predictions <- predict(best_trained_model, newdata=testing_dataset)
res_ <- caret::confusionMatrix(table(predictions, testing_dataset$Churn))
print("Results from the BEST trained model ... ...\n");
print(round(res_$overall, digits = 3))
#getwd()
saveRDS(best_trained_model, "./best_trained_model.rds")
# load the model
#getwd()
saved_model <- readRDS("./best_trained_model.rds")
print(saved_model)
# make a predictions on "new data" using the final model
final_predictions <- predict(saved_model, dataSet[1:20])
confusionMatrix(table(final_predictions, dataSet$Churn))
res_ <- confusionMatrix(table(final_predictions, dataSet$Churn))
print("Results from the BEST trained model ... ...\n");
print(round(res_$overall, digits = 3))
print(res_$table)
fourfoldplot(res_$table, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "Confusion Matrix")