-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
482 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
--- | ||
title: "Practical Machine Learning Project" | ||
author: "Marcus A. Streips" | ||
date: "February 13, 2015" | ||
output: html_document | ||
--- | ||
|
||
##Background## | ||
Six participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways (A,B,C,D and E). Accelerometer measurements were taken on the belt, forearm, arm, and dumbell. The goal of this project is to use the data gathered from the experiment to predict the manner in which the six participants did the exercise. A random forest model is proposed, tested and used to predict the activities using a testing data set with the activity ("classe" variable) redacted. The results are presented. | ||
|
||
More information is available from the website <http://groupware.les.inf.puc-rio.br/har>. | ||
|
||
###Load libraries, set seed, and activate parallel processing### | ||
```{r, warning=FALSE, message=FALSE} | ||
library(caret) | ||
library(RANN) | ||
library(dplyr) | ||
setwd("~/Google Drive/Coursera/Machine Learning/Project") | ||
set.seed(383838) | ||
#parallel processing | ||
library(doMC) | ||
registerDoMC(cores = 4) | ||
``` | ||
|
||
###Import Testing and Training Data### | ||
```{r, cache=TRUE, warning=FALSE, message=FALSE} | ||
training <- read.csv("pml-training.csv", header=TRUE, | ||
na.strings=c("", " ","#DIV/0!")) | ||
testing <- read.csv("pml-testing.csv", header=TRUE, na.strings=c("", " ", "#DIV/0!")) | ||
``` | ||
|
||
##Pre-Processing## | ||
An extensive effort was made to preprocess the data to optimize the model to include: | ||
|
||
* Imputation | ||
* Eliminating Near Zero Value Variables | ||
* PCA | ||
* Removing Highly Correlated Predictors | ||
* Standardizing Data with Center and Scale | ||
|
||
It was determined that these pre-processing steps were unneccesary to develop a highly accurate model and so they are not presented here. | ||
|
||
Data that was not continuous was removed as well as all variables lacking data in the testing data set. | ||
|
||
```{r, cache=TRUE, warning=FALSE, message=FALSE} | ||
#Remove Unnessary fields and fields with all NAs | ||
new_names <- sapply(testing[,1:160], mean) | ||
na_fields <- as.data.frame(new_names, row.names=NULL) | ||
na_fields$names <- rownames(na_fields) | ||
fields <- filter(na_fields, new_names != "NA") | ||
new_fields <- as.vector(fields[,2]) | ||
new_fields_sm <- new_fields[5:56] | ||
new_testing <- testing[new_fields_sm] | ||
new_training <- training[new_fields_sm] | ||
#format all data as numeric | ||
new_testing <- as.data.frame(lapply(new_testing, as.numeric)) | ||
new_training <- as.data.frame(lapply(new_training, as.numeric)) | ||
#add back classe to training | ||
new_training$classe <- training$classe | ||
``` | ||
|
||
##Partitioning## | ||
The training data was partitioned so it could be used to build a prediction model. | ||
```{r, cache=TRUE} | ||
inTrain <- createDataPartition(y=new_training$classe, p=0.75, list=FALSE) | ||
train <- new_training[inTrain,] | ||
test <- new_training[-inTrain,] | ||
``` | ||
|
||
##Cross-Validation## | ||
A k-fold cross validation with k=3 was chosen over bootstrapping because it was less computationally demanding. Three folds were sufficient do cross validate a highly accurate rf model. A smaller k has more bias, but less variance. | ||
```{r, cache=TRUE} | ||
#crossValidation | ||
train_control <- trainControl(method="cv", number=3) | ||
``` | ||
|
||
##Training the Model## | ||
A random forest was chosen for the model because it is one of the most accurate models in Kaggle competitions and incorporates its own internal cross-validation. It performed better than the following models which are not presented here: | ||
|
||
* rpart | ||
* naive bayes | ||
* gradient boosing machine | ||
|
||
```{r, cache=TRUE, message=FALSE} | ||
# train the model | ||
model <- train(classe~., data=train, trControl=train_control, method="rf") | ||
``` | ||
|
||
##Predict from the Model## | ||
The random forest model is used to run predictions on the training partition and testing partition. Using the `confusionMatrix` function from the `caret` package we are able to determine our model's accuracy. | ||
```{r, cache=TRUE} | ||
#predict on training partion | ||
predict <- predict(model,train) | ||
confusionMatrix(predict, train$classe) | ||
#predict on testing partition | ||
predict2 <- predict(model, test) | ||
confusionMatrix(predict2, test$classe) | ||
``` | ||
|
||
##Expectation for Out-of-Sample Error## | ||
```{r} | ||
#error rate | ||
missClass = function(values, prediction) { | ||
sum(prediction != values)/length(values) | ||
} | ||
errRate = missClass(test$classe, predict2) #same as 1-0.9929 | ||
errRate | ||
``` | ||
|
||
##Estimating Error with Cross-Validation## | ||
Using the K(3)-folds cross validation we see the resampling results across the tuning parameters and the corresponding accuracies, the inverse of which would be the errors. | ||
```{r} | ||
model | ||
``` | ||
|
||
##Analyzing Model## | ||
Using the `varImp` function of the `caret` package we can review our model to determine which variable were the most important in making our predicitions with an accuracy of 99.29%. | ||
```{r, fig.height=8, message=FALSE} | ||
#plotting importance of variables | ||
varImpObj <- varImp(model) | ||
plot(varImpObj) | ||
``` |
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file added
BIN
+240 KB
Machine_Learning/Project_streips_files/figure-html/unnamed-chunk-10-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+57.8 KB
Machine_Learning/Project_streips_files/figure-html/unnamed-chunk-3-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
library(caret) | ||
library(RANN) | ||
library(dplyr) | ||
setwd("~/Google Drive/Coursera/Machine Learning/Project") | ||
set.seed(383838) | ||
|
||
#parallel processing | ||
library(doMC) | ||
registerDoMC(cores = 4) | ||
|
||
#import data | ||
training <- read.csv("pml-training.csv", header=TRUE, | ||
na.strings=c("", " ","#DIV/0!")) #not sure about stringasfactors | ||
testing <- read.csv("pml-testing.csv", header=TRUE, na.strings=c("", " ", "#DIV/0!")) | ||
|
||
#Remove Unnessary fields and fields with all NAs | ||
new_names <- sapply(testing[,1:160], mean) | ||
na_fields <- as.data.frame(new_names, row.names=NULL) | ||
na_fields$names <- rownames(na_fields) | ||
fields <- filter(na_fields, new_names != "NA") | ||
new_fields <- as.vector(fields[,2]) | ||
new_fields_sm <- new_fields[5:56] | ||
|
||
new_testing <- testing[new_fields_sm] | ||
new_training <- training[new_fields_sm] | ||
|
||
#new transformation | ||
new_testing <- as.data.frame(lapply(new_testing, as.numeric)) | ||
new_training <- as.data.frame(lapply(new_training, as.numeric)) | ||
|
||
table(is.na(new_training)) #no values are NA | ||
|
||
#add back classe to training | ||
new_training$classe <- training$classe | ||
|
||
# partition data | ||
inTrain <- createDataPartition(y=new_training$classe, p=0.75, list=FALSE) | ||
train <- new_training[inTrain,] | ||
test <- new_training[-inTrain,] | ||
|
||
#crossValidation | ||
train_control <- trainControl(method="cv", number=3) | ||
|
||
# train the model | ||
model <- train(classe~., data=train, trControl=train_control, method="rf") | ||
|
||
#predict from model | ||
predict <- predict(model,train) | ||
confusionMatrix(predict, train$classe) | ||
|
||
predict2 <- predict(model, test) | ||
confusionMatrix(predict2, test$classe) | ||
|
||
#plotting importance of variables | ||
varImpObj <- varImp(model) | ||
plot(varImpObj) | ||
|
||
#error rate | ||
missClass = function(values, prediction) { | ||
sum(prediction != values)/length(values) | ||
} | ||
errRate = missClass(test$classe, predict2) #same as 1-accuracy | ||
|
||
#submission code | ||
pml_write_files = function(x){ | ||
n = length(x) | ||
for(i in 1:n){ | ||
filename = paste0("problem_id_",i,".txt") | ||
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE) | ||
} | ||
} | ||
|
||
x <- new_testing | ||
answers <- predict(model, newdata=x) | ||
answers | ||
|
||
pml_write_files(answers) |