How to setup a Machine Learning Classification problem in R

In [3]:
# -----------------------------------------------------------------
# How to setup a Machine Learning Classification problem in R 
# -----------------------------------------------------------------
# load libraries
library(mlbench)
library(caret)

# load data
data(PimaIndiansDiabetes)

# rename dataset to keep code below generic
dataset <- PimaIndiansDiabetes
dim(dataset)
sapply(dataset, class)

# Pre-Processing of DataSet i.e. train : test split
train_test_index <- createDataPartition(dataset$diabetes, p=0.67, list=FALSE)
training_dataset <- dataset[train_test_index,]
testing_dataset <- dataset[-train_test_index,]

# setup cross validation and control parameters
control <- trainControl(method="repeatedcv", number=10, repeats = 10, verbose = FALSE, search = "grid")
metric <- "Accuracy"

# Training process 
# Fit / train a Linear Discriminant Analysis model to the training dataset
fit.lda <- caret::train(diabetes~., data=training_dataset, method="lda", metric=metric, 
                        preProc=c("center", "scale"), trControl=control)

# Fit / train a Logistic Regression model to the training dataset
fit.glm <- caret::train(diabetes~., data=training_dataset, method="glm", metric=metric, 
                        preProc=c("center", "scale"), trControl=control)

# collect the results of trained models
results <- resamples(list(LDA = fit.lda, GLM = fit.glm))

# Summarize the fitted models
summary(results)

# Plot and rank the fitted models
dotplot(results)
bwplot(results)

# Test skill of the BEST trained model on validation/testing dataset
predictions_LDA <- predict(fit.lda, newdata=testing_dataset)

# Evaluate the BEST trained model and print results
res_  <- caret::confusionMatrix(predictions_LDA, testing_dataset$diabetes)

print("Results from the BEST trained model ... ...\n"); 
print(round(res_$overall, digits = 3))
  1. 768
  2. 9
pregnant
'numeric'
glucose
'numeric'
pressure
'numeric'
triceps
'numeric'
insulin
'numeric'
mass
'numeric'
pedigree
'numeric'
age
'numeric'
diabetes
'factor'
Call:
summary.resamples(object = results)

Models: LDA, GLM 
Number of resamples: 100 

Accuracy 
         Min.   1st Qu.   Median      Mean   3rd Qu.      Max. NA's
LDA 0.6078431 0.7115385 0.747549 0.7481523 0.7884615 0.9019608    0
GLM 0.6153846 0.7254902 0.750000 0.7472210 0.7843137 0.8627451    0

Kappa 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
LDA 0.07103825 0.3286609 0.4227761 0.4158794 0.5034014 0.7769029    0
GLM 0.10344828 0.3175837 0.4090909 0.4127461 0.4881979 0.6956522    0
[1] "Results from the BEST trained model ... ...\n"
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
         0.814          0.566          0.761          0.860          0.652 
AccuracyPValue  McnemarPValue 
         0.000          0.004 
In [ ]: