Abalone Prediction in R

In [8]:
# Abalone

# Background
# Dataset description: http://archive.ics.uci.edu/ml/datasets/Abalone

# 0. Load Libraries
library(AppliedPredictiveModeling)
library(caret)
library(doMC)
library(corrplot)
registerDoMC(cores=4)

# 1. Load Dataset
data(abalone)
dataset <- abalone

# 2. Summarize Dataset
dim(dataset)
# list types for each attribute
sapply(dataset, class)
# split input and output
x <- dataset[,1:8]
y <- dataset[,9]
# summarize attribute distributions
summary(dataset)
# summarize correlations between input variables
cor(x[,2:8])

# 3. Visualize Dataset
# a) Univariate
# boxplots for each attribute
par(mfrow=c(3,3))
for(i in 2:9) {
	boxplot(dataset[,i], main=names(dataset)[i])
}
# histograms each attribute
par(mfrow=c(3,3))
for(i in 2:9) {
	hist(dataset[,i], main=names(dataset)[i])
}
# density plot for each attribute
par(mfrow=c(3,3))
for(i in 2:9) {
	plot(density(dataset[,i]), main=names(dataset)[i])
}

par(mfrow=c(1,1))
# b) Multivariate
# scatterplot matrix
pairs(dataset)

# correlation plot
correlations <- cor(dataset[,2:9])
corrplot(correlations, method="circle")

# 4. Feature Selection
# a) remove redundant
# b) remove highly correlated

# 5. Data Transforms

# 6. Evaluate Algorithms
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
seed <- 6

# a) linear algorithms
# GLM
set.seed(seed)
fit.glm <- train(Rings~., data=dataset, method="glm", metric=metric, trControl=control)
print(fit.glm)

# lm
set.seed(seed)
fit.lm <- train(Rings~., data=dataset, method="lm", metric=metric, trControl=control)
print(fit.lm)

# b) nonlinear algorithms
# SVM
set.seed(seed)
fit.svm <- train(Rings~., data=dataset, method="svmRadial", metric=metric, trControl=control)
print(fit.svm)

# CART
set.seed(seed)
fit.cart <- train(Rings~., data=dataset, method="rpart", metric=metric, trControl=control)
print(fit.cart)

# kNN
set.seed(seed)
fit.knn <- train(Rings~., data=dataset, method="knn", metric=metric, trControl=control)
print(fit.knn)

# c) advanced algorithms
# Bagging
set.seed(seed)
fit.bagging <- train(Rings~., data=dataset, method="treebag", metric=metric)
print(fit.bagging)

# Gradient Boosting
set.seed(seed)
fit.boosting <- train(Rings~., data=dataset, method="gbm", metric=metric, trControl=control)
print(fit.boosting)
  1. 4177
  2. 9
Type
'factor'
LongestShell
'numeric'
Diameter
'numeric'
Height
'numeric'
WholeWeight
'numeric'
ShuckedWeight
'numeric'
VisceraWeight
'numeric'
ShellWeight
'numeric'
Rings
'integer'
 Type      LongestShell      Diameter          Height        WholeWeight    
 F:1307   Min.   :0.075   Min.   :0.0550   Min.   :0.0000   Min.   :0.0020  
 I:1342   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4415  
 M:1528   Median :0.545   Median :0.4250   Median :0.1400   Median :0.7995  
          Mean   :0.524   Mean   :0.4079   Mean   :0.1395   Mean   :0.8287  
          3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1530  
          Max.   :0.815   Max.   :0.6500   Max.   :1.1300   Max.   :2.8255  
 ShuckedWeight    VisceraWeight     ShellWeight         Rings       
 Min.   :0.0010   Min.   :0.0005   Min.   :0.0015   Min.   : 1.000  
 1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300   1st Qu.: 8.000  
 Median :0.3360   Median :0.1710   Median :0.2340   Median : 9.000  
 Mean   :0.3594   Mean   :0.1806   Mean   :0.2388   Mean   : 9.934  
 3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290   3rd Qu.:11.000  
 Max.   :1.4880   Max.   :0.7600   Max.   :1.0050   Max.   :29.000  
LongestShellDiameterHeightWholeWeightShuckedWeightVisceraWeightShellWeight
LongestShell1.00000000.98681160.82755360.92526120.89791370.90301770.8977056
Diameter0.98681161.00000000.83368370.92545210.89316250.89972440.9053298
Height0.82755360.83368371.00000000.81922080.77497230.79831930.8173380
WholeWeight0.92526120.92545210.81922081.00000000.96940550.96637510.9553554
ShuckedWeight0.89791370.89316250.77497230.96940551.00000000.93196130.8826171
VisceraWeight0.90301770.89972440.79831930.96637510.93196131.00000000.9076563
ShellWeight0.89770560.90532980.81733800.95535540.88261710.90765631.0000000
Generalized Linear Model 

4177 samples
   8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 3759, 3759, 3759, 3759, 3760, 3760, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  2.213138  0.530665  1.585958

Linear Regression 

4177 samples
   8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 3759, 3759, 3759, 3759, 3760, 3760, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  2.213138  0.530665  1.585958

Tuning parameter 'intercept' was held constant at a value of TRUE
Support Vector Machines with Radial Basis Function Kernel 

4177 samples
   8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 3759, 3759, 3759, 3759, 3760, 3760, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE     
  0.25  2.181121  0.5634193  1.482162
  0.50  2.153219  0.5696081  1.470037
  1.00  2.142313  0.5711541  1.466745

Tuning parameter 'sigma' was held constant at a value of 0.259324
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.259324 and C = 1.
Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
“There were missing values in resampled performance measures.”
CART 

4177 samples
   8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 3759, 3759, 3759, 3759, 3760, 3760, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE     
  0.03890424  2.626421  0.3370765  1.949066
  0.05432313  2.701929  0.2974648  2.004954
  0.28217437  2.996754  0.2543435  2.205786

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.03890424.
k-Nearest Neighbors 

4177 samples
   8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 3759, 3759, 3759, 3759, 3760, 3760, ... 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  2.239805  0.5211464  1.575896
  7  2.197834  0.5379101  1.534888
  9  2.188602  0.5420856  1.526652

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.
Bagged CART 

4177 samples
   8 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 4177, 4177, 4177, 4177, 4177, 4177, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  2.272856  0.4995728  1.622254

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        9.6453             nan     0.1000    0.7458
     2        9.0169             nan     0.1000    0.5774
     3        8.5173             nan     0.1000    0.5141
     4        8.1018             nan     0.1000    0.4200
     5        7.7435             nan     0.1000    0.3702
     6        7.4350             nan     0.1000    0.3005
     7        7.1538             nan     0.1000    0.2685
     8        6.9395             nan     0.1000    0.2059
     9        6.7364             nan     0.1000    0.2027
    10        6.5644             nan     0.1000    0.1652
    20        5.5219             nan     0.1000    0.0408
    40        4.8348             nan     0.1000    0.0059
    60        4.5171             nan     0.1000    0.0099
    80        4.3371             nan     0.1000    0.0032
   100        4.2196             nan     0.1000   -0.0021
   120        4.1315             nan     0.1000   -0.0031
   140        4.0683             nan     0.1000    0.0002
   150        4.0236             nan     0.1000   -0.0020

Stochastic Gradient Boosting 

4177 samples
   8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 3759, 3759, 3759, 3759, 3760, 3760, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE      Rsquared   MAE     
  1                   50      2.438257  0.4390650  1.760440
  1                  100      2.342626  0.4810819  1.686715
  1                  150      2.288431  0.5026718  1.646601
  2                   50      2.280865  0.5085633  1.636033
  2                  100      2.186541  0.5425262  1.552927
  2                  150      2.166855  0.5492125  1.533952
  3                   50      2.218790  0.5317232  1.578110
  3                  100      2.166761  0.5493806  1.531263
  3                  150      2.156756  0.5534389  1.521503

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 150, interaction.depth =
 3, shrinkage = 0.1 and n.minobsinnode = 10.
In [ ]: