To begin, we use 3 of 4 virtual cores, and load the data, converting strings to factors.
set.seed(12345)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoParallel(cores=3)
dat <- read.table("/SAFE/PII2/EDU_COURSES/MachineLearning/pml-training.csv", sep=",", header=T, row.names=1, stringsAsFactors=TRUE)
We now partition the data, using 70% for training and the rest for validation. (Initially we created models with only 1% and then 10% of the data, for efficiently iterating the model building process).
We also use just a subset of 32 of the columns; the other predictors had too many NA or invalid values, and it was unclear how well we could impute these values. We found the appropriate columns by looking as summary() data, as well as trial and error.
trainIndex <- createDataPartition(y=dat$classe, times=1, p = .7,list = FALSE)
trainDat <- dat[trainIndex,]
trainSub <- trainDat[,grep('^classe|^magnet|^accel|^yaw|^total',names(trainDat))]
validDat <- dat[-trainIndex,]
validSub <- validDat[,grep('^classe|^magnet|^accel|^yaw|^total',names(trainDat))]
testDat <- read.table("/SAFE/PII2/EDU_COURSES/MachineLearning/pml-testing.csv", sep=",", header=T, row.names=1)
testSub <- testDat[,grep('^classe|^magnet|^accel|^yaw|^total',names(testDat))]
We then tried multiple model types, each using 10-fold cross validation. They resulted in different accuracy levels estimated as follows:
Random forests produced better accuracy than the above types. We therefore selected random forests to build our final model as follows:
fn <- "/SAFE/PII2/EDU_COURSES/MachineLearning/project_mod.rds"
if ( file.exists(fn))
{
mod <- readRDS(fn)
} else
{
mod <-train( classe ~ ., method="rf", data=trainSub, trControl=trainControl(method="cv",number=10))
saveRDS(mod, file=fn)
}
Inspecting the results:
mod$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9827475 0.9781712 0.002971340 0.003761625
## 2 17 0.9839854 0.9797349 0.003394459 0.004297851
## 3 32 0.9810732 0.9760508 0.003413001 0.004321374
Evaluating confusion matrices on training and validation sets:
confusionMatrix( trainSub$classe, predict(mod,trainSub))
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3906 0 0 0 0
## B 0 2658 0 0 0
## C 0 0 2396 0 0
## D 0 0 0 2252 0
## E 0 0 0 0 2525
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
confusionMatrix( validSub$classe, predict(mod,validSub))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 1 0 1 1
## B 15 1115 7 2 0
## C 4 16 1003 3 0
## D 0 0 21 942 1
## E 1 1 5 4 1071
##
## Overall Statistics
##
## Accuracy : 0.9859
## 95% CI : (0.9825, 0.9888)
## No Information Rate : 0.2873
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9822
## Mcnemar's Test P-Value : 3.195e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9882 0.9841 0.9681 0.9895 0.9981
## Specificity 0.9993 0.9949 0.9953 0.9955 0.9977
## Pos Pred Value 0.9982 0.9789 0.9776 0.9772 0.9898
## Neg Pred Value 0.9953 0.9962 0.9932 0.9980 0.9996
## Prevalence 0.2873 0.1925 0.1760 0.1618 0.1823
## Detection Rate 0.2839 0.1895 0.1704 0.1601 0.1820
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9937 0.9895 0.9817 0.9925 0.9979
The results were as follows:
Finally, we predict the values of the test set using the model:
predict(mod,testSub)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E