Install these packages first:
- caret (Wrapper for various ML algorithms)
- tidyverse (meta package)
- ranger (for running Random Forest method)
Acknowledging the efforts of Max Kuhn and Hadley Wickham, without who, the world of ML and data science in R wouldn't be the way it is now.
- Check out Max Kuhn's tutorial of Caret (The caret Package)
- Check out Hadley's book on R for Data Science. (R4DS)
There are three major approaches to using Caret:
- Split the data into train and test datasets. Ideally, a split of 80-20 or 75-25 is good enough. But, if there is a class imbalance, there is another workflow to split the data. Read Max's tutorial to find out more.
- If there are NA values in the datasets, impute the datasets. For categorical values, it may be goo idea to impute the missing values with the mode of the values. For numeric values, there are a few options- median imputation, knn imputation, etc. Read Max's tutorial more for more details.
- After imputation, it is a good idea to encode your categorila values to 1 and 0. This is because most models expect the datasets in a numerical format. Again, refer to Max's tutorials for more details.
- After encoding, depending on your dataset, normalize the data. There are many normailization options. See the code comments for more information.
- Repeat these preprocessing steps for both the train and test datasets.
- Run the model for the train data.
- Validate the model against the test data.
- Calculate the accuracy of the model.
- Iterate the steps again for another model and you may perform cross validation.
- Pick the mode with the best accuracy and peformance.