Coursera Machine Learning Project

To begin, we use 3 of 4 virtual cores, and load the data, converting strings to factors.

set.seed(12345)
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(doParallel)

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

registerDoParallel(cores=3)
dat <- read.table("/SAFE/PII2/EDU_COURSES/MachineLearning/pml-training.csv", sep=",", header=T, row.names=1, stringsAsFactors=TRUE)

We now partition the data, using 70% for training and the rest for validation. (Initially we created models with only 1% and then 10% of the data, for efficiently iterating the model building process).

We also use just a subset of 32 of the columns; the other predictors had too many NA or invalid values, and it was unclear how well we could impute these values. We found the appropriate columns by looking as summary() data, as well as trial and error.

trainIndex <- createDataPartition(y=dat$classe, times=1, p = .7,list = FALSE)

trainDat <- dat[trainIndex,]
trainSub <- trainDat[,grep('^classe|^magnet|^accel|^yaw|^total',names(trainDat))]

validDat <- dat[-trainIndex,]
validSub <- validDat[,grep('^classe|^magnet|^accel|^yaw|^total',names(trainDat))]

testDat <- read.table("/SAFE/PII2/EDU_COURSES/MachineLearning/pml-testing.csv", sep=",", header=T, row.names=1)
testSub <- testDat[,grep('^classe|^magnet|^accel|^yaw|^total',names(testDat))]

We then tried multiple model types, each using 10-fold cross validation. They resulted in different accuracy levels estimated as follows:

naive bayes: 72% accuracy
gbm: 93% accuracy
gbm with pca preprog step: 82% accuracy

Random forests produced better accuracy than the above types. We therefore selected random forests to build our final model as follows:

fn <- "/SAFE/PII2/EDU_COURSES/MachineLearning/project_mod.rds"
if ( file.exists(fn))
{
  mod <- readRDS(fn)
} else
{
mod <-train( classe ~ ., method="rf", data=trainSub, trControl=trainControl(method="cv",number=10))
saveRDS(mod, file=fn)
}

Inspecting the results:

mod$results

##   mtry  Accuracy     Kappa  AccuracySD     KappaSD
## 1    2 0.9827475 0.9781712 0.002971340 0.003761625
## 2   17 0.9839854 0.9797349 0.003394459 0.004297851
## 3   32 0.9810732 0.9760508 0.003413001 0.004321374

Evaluating confusion matrices on training and validation sets:

confusionMatrix( trainSub$classe, predict(mod,trainSub))

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3906    0    0    0    0
##          B    0 2658    0    0    0
##          C    0    0 2396    0    0
##          D    0    0    0 2252    0
##          E    0    0    0    0 2525
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

confusionMatrix( validSub$classe, predict(mod,validSub))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    1    0    1    1
##          B   15 1115    7    2    0
##          C    4   16 1003    3    0
##          D    0    0   21  942    1
##          E    1    1    5    4 1071
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9859          
##                  95% CI : (0.9825, 0.9888)
##     No Information Rate : 0.2873          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9822          
##  Mcnemar's Test P-Value : 3.195e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9882   0.9841   0.9681   0.9895   0.9981
## Specificity            0.9993   0.9949   0.9953   0.9955   0.9977
## Pos Pred Value         0.9982   0.9789   0.9776   0.9772   0.9898
## Neg Pred Value         0.9953   0.9962   0.9932   0.9980   0.9996
## Prevalence             0.2873   0.1925   0.1760   0.1618   0.1823
## Detection Rate         0.2839   0.1895   0.1704   0.1601   0.1820
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9937   0.9895   0.9817   0.9925   0.9979

The results were as follows:

training data: accuracy 100%
estimate of out-of-sample accuracy: 98.4%
validation set accuracy: 98.6%

Finally, we predict the values of the test set using the model:

predict(mod,testSub)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Coursera Machine Learning Project

Robin Chauhan

2015-07-24