--- title: "Math 189 Homework 5" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ### Hien Bui, Cole Clisby, Cheolmin Hwang, Andrew Li, James Yu ## Intro In this report, we utilize an automobile dataset to create a predictive model that determines if a given car’s gas mileage is high or low, relative to the sample median. We accomplish this by first creating a new binary feature, then by visualizing the relationship it has with other variables. Lastly, we split the dataset into training and test data and perform Linear Discriminant Analysis (LDA) to create our model and test its accuracy. ## Body ### Data ```{r} library(ISLR) auto <- ISLR::Auto head(auto) ``` Source: The Auto dataset is presented in An Introduction to Statistical Learning with applications in R by James, Witten, Hastie, and Tibshirani (2013). URL: https://rdrr.io/cran/ISLR/man/Auto.html Description: Contains 392 observations of automobiles. There are 9 variables containing information on the specifications and name of each car. 1. mpg: in miles per gallon 2. cylinders: in number of cylinders the engine has 3. displacement: engine displacement in cubic inches 4. horsepower: in horsepower 5. weight: in pounds 6. acceleration: in number of seconds to accelerate from 0 to 60mph 7. year: in last two digits of model year (e.g. 1970 -> 70) 8. origin: 1 if American, 2 if European, 3 if Japanese 9. name: vehicle name ### Task 1 ```{r} mpg_med <- median(auto$mpg) mpg_med mgp01 <- c() for (val in auto$mpg) { if (val < mpg_med) { mgp01 <- append(mgp01, 0) } else if (val > mpg_med) { mgp01 <- append(mgp01, 1) } } auto <- cbind(auto, mgp01) head(auto) ``` The median is an average of the two middlemost elements because we have an even number of observations. So no values will be exactly equal to the median. ### Task 2 ```{r} boxplot(auto$displacement ~ auto$mgp01, data=auto, xlab='mgp01', ylab='Displacement') title('Boxplots of Displacement by mgp01 class') ``` ```{r} boxplot(auto$horsepower ~ auto$mgp01, data=auto, xlab='mgp01', ylab='Horsepower') title('Boxplots of Horsepower by mgp01 class') ``` ```{r} boxplot(auto$weight ~ auto$mgp01, data=auto, xlab='mgp01', ylab='Weight') title('Boxplots of Weight by mgp01 class') ``` ```{r} boxplot(auto$acceleration ~ auto$mgp01, data=auto, xlab='mgp01', ylab='Acceleration') title('Boxplots of Acceleration by mgp01 class') ``` ```{r} boxplot(auto$year ~ auto$mgp01, data=auto, xlab='mgp01', ylab='Year') title('Boxplots of Year by mgp01 class') ``` When mgp01 is 1 (mpg above the median), the boxplots for displacement, horsepower, and weight are lower than the boxplots for those variables when mgp01 is 0. In contrast, the year and acceleration are higher for when mgp01 is 1, compared to when mgp01 is 0. This makes sense since cars with a smaller engine displacement, less horsepower and weight will be more efficient and have a higher miles per gallon, since the car is less powerful and weighs less. So we can get more miles per gallon as the car doesn’t have to drive around the extra components that make it faster. It also makes sense that the year on average would be higher (i.e. manufactured later) since cars can get more efficient over time as new manufacturing processes and optimizations can be made. The boxplots for acceleration between the two groups of mgp01 are pretty similar, but the 1 group is slightly higher and has some high value outliers, meaning that it takes longer to accelerate from 0 to 60mph, which matches with the other information we visualized and discussed above. We can still visualize the relationship between mgp01 and some of the categorical variables like cylinders and origin, even though we won't be able to use them in our LDA. ```{r} require(gridExtra) library(ggplot2) temp_auto <- auto temp_auto$mgp01 <- as.factor(temp_auto$mgp01) gp1 <- ggplot(temp_auto, aes(x=origin, fill=mgp01, color=mgp01)) + geom_histogram(position='dodge', bins=6) + ggtitle('Histogram of Origin by mgp01 class') gp2 <- ggplot(temp_auto, aes(x=cylinders, fill=mgp01, color=mgp01)) + geom_histogram(position='dodge', bins=6) + ggtitle('Histogram of Cylinders by mgp01 class') grid.arrange(gp1, gp2, ncol=2) ``` We can see that the 0 class is more likely to have origin 1, which is American made. The 1 class is more likely to have origins 2 or 3, which are European or Japanese made. This shows how the American made cars in the sample are less fuel efficient, since more of them are in the 0 class, meaning their miles per gallon is below the median. The 1 class is more likely to have 4 cylinders, with the 0 class having a much higher proportion of 6 and 8 cylinder engines. This indicates that the more cylinders the engine has, the less efficient it can be, causing the miles per gallon to decrease. ### Task 3 Ultimately, we chose to use displacement, horsepower, weight, and year since they showed the strongest relationship to mgp01 in our visualizations. We didn't use any of the categorical/ordinal variables like origin, cylinders, or name since we are performing LDA and it assumes that the variables used will be continuous. ```{r} auto.train <- subset(auto[sample(nrow(auto), 300), ], select=c('displacement', 'horsepower', 'weight', 'year', 'mgp01')) auto.test <- subset(auto[sample(nrow(auto), 92), ], select=c('displacement', 'horsepower', 'weight', 'year', 'mgp01')) nrow(auto.train) nrow(auto.test) ``` We split the data into a training set of size 300 and test set of size 92, using random sampling. ### Task 4 Assumptions made: 1. The data from each group has a common mean vector that is equal to the population mean vector 2. Homoskedasticity: the data from all groups has a common covariance matrix that does not depend on the group 3. Independence: the observations are independently sampled 4. Normality: the data are multivariate normally distributed ```{r} # Prior probabilities - based on relative sample size in training data n_0 <- sum(auto.train$mgp01 == 0) n_1 <- sum(auto.train$mgp01 == 1) p_0 <- n_0 / 300 p_1 <- n_1 / 300 p_0 p_1 ``` We chose to use the training data prior probabilities since we believe that the relative sample sizes in the training data are close to the relative population sizes. ```{r} # Sample mean vectors for each group num_feat <- 4 mean_0 <- colMeans(auto.train[auto.train$mgp01 == 0, 1:num_feat]) mean_1 <- colMeans(auto.train[auto.train$mgp01 == 1, 1:num_feat]) mean_0 mean_1 # Pooled sample covariance for each group S_0 <- cov(auto.train[auto.train$mgp01 == 0, 1:num_feat]) S_1 <- cov(auto.train[auto.train$mgp01 == 1, 1:num_feat]) S_pooled <- ((n_0-1)*S_0 + (n_1-1)*S_1) / (n_0 + n_1 - 2) S_pooled # Intercepts of Linear Discriminant Function S_inv <- solve(S_pooled) alpha_0 <- -0.5 * t(mean_0) %*% S_inv %*% mean_0 + log(p_0) alpha_1 <- -0.5 * t(mean_1) %*% S_inv %*% mean_1 + log(p_1) alpha_auto <- c(alpha_0, alpha_1) alpha_auto # Slope intercepts of LD function beta_0 <- S_inv %*% mean_0 beta_1 <- S_inv %*% mean_1 beta_auto <- cbind(beta_0, beta_1) beta_auto ``` ### Task 5 ```{r} prediction <- c() d_0_vec <- c() d_1_vec <- c() label <- c(0, 1) for (i in 1:nrow(auto.test)) { # Gets the features for individual test observation x <- t(auto.test[i, 1:num_feat]) # Calculates LD functions for each group d_0 <- alpha_0 + t(beta_0) %*% x d_1 <- alpha_1 + t(beta_1) %*% x # Classify the observation to the class w/ highest val d_vec <- c(d_0, d_1) prediction <- append(prediction, label[which.max(d_vec)]) d_0_vec <- append(d_0_vec, d_0) d_1_vec <- append(d_1_vec, d_1) } auto.test$prediction <- prediction # Calculating accuracy for each class zero_true <- sum((auto.test$mgp01 == 0) & (auto.test$prediction == 0)) one_true <- sum((auto.test$mgp01 == 1) & (auto.test$prediction == 1)) # Creates table of results n_0_full <- sum(auto$mgp01 == 0) n_1_full <- sum(auto$mgp01 == 1) class_tab <- c(sum(auto.test$mgp01==0), sum(auto.test$mgp01==1)) class_tab <- rbind(class_tab, c(zero_true, one_true)) class_tab <- rbind(class_tab, class_tab[1,] - class_tab[2,]) colnames(class_tab) <- c('Class 0', 'Class 1') rownames(class_tab) <- c('Num Observations', 'Num Correct', 'Num Wrong') class_tab acc <- (zero_true + one_true) / nrow(auto.test) acc ``` We get around 0.85 to 0.90 accuracy on the test set. Because our test set size is only 92 and we are using a random sample to get our training and test sets, the accuracy can vary somewhat. However, generally the LDA classification performs well. Since we have two classes, if we randomly guess the class each time we would get roughly 0.5 accuracy. So our LDA classification is much better than chance. ## Conclusion We were able to perform LDA on the Auto dataset in order to classify if a given car was above or below the median miles per gallon. We created the binary variable mgp01 and graphed its relationship with the other variables in order to determine which of the variables would be useful in our LDA. We chose to use displacement, horsepower, weight, and year since they looked to have the strongest relationship with mgp01. We used random sampling to create a training and test set, and performed LDA on the training set. Using our test set, we got roughly 85-90% accuracy, with some variance due to our sets being random samples.