Skip to content

Nelson-Gon/manymodelr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Development Version 0.2.5.9000

CRAN_Status_Badge DOI Build Status R CMDCheck license Rdoc Downloads TotalDownloads lifecycle Maintenance Project Status Codecov test coverage GitHub last commit GitHub issues GitHub issues-closed Average time to resolve an issue PRs Welcome

  • Installing the package

  • From CRAN(0.2.4)

install.packages("manymodelr")

  • From GitHub
# development version(more stable)
remotes::install_github("Nelson-Gon/manymodelr")
devtools::install_github("Nelson-Gon/manymodelr")
devtools::install_github("Nelson-Gon/manymodelr",build_vignettes=TRUE) #Builds vignettes

For the current(unstable) developer version, please see develop.

To install the less stable dev version:

devtools::install_github("Nelson-Gon/manymodelr@develop")

For previous users, please see the NEWS.md file for a list of changes and/or additions.

  • Loading the package

library(manymodelr)

Example usage of major functions

  1. multi_model_1
suppressMessages(library(caret))
set.seed(520)
train_set<-createDataPartition(iris$Species,p=0.8,list=FALSE)
valid_set<-iris[-train_set,]
train_set<-iris[train_set,]
ctrl<-trainControl(method="cv",number=5)

 m<-multi_model_1(train_set,"Species",".",c("knn","rpart"),"Accuracy",ctrl,newdata =valid_set)

In the above we have trained and also got predictions(validated) on our data.

Results:

To get the metrics for all our models, we can proceed as follows:

m$Metrics
# A tibble: 1 x 2
    knn rpart
  <dbl> <dbl>
1     1 0.933

To obtain the predicted values(validation in this case):

head(m$Predictions)
# A tibble: 6 x 2
  knn    rpart 
  <fct>  <fct> 
1 setosa setosa
2 setosa setosa
3 setosa setosa
4 setosa setosa
5 setosa setosa
6 setosa setosa

One can also get all the corresponding model statistics as follows:

m$modelInfo

  1. A related function is multi_model_2 that aims to allow fitting and prediction in the same function. This builds on top of other modeling functions meaning that it can work with any model from any package. For demonstration purposes, only linear models will be shown.
# fit a linear model and get predictions
head(multi_model_2(iris[1:50,],iris[50:99,],"Sepal.Length","Petal.Length","lm"))
 
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species predicted
1          5.1         3.5          1.4         0.2  setosa  4.972378
2          4.9         3.0          1.4         0.2  setosa  6.761943
3          4.7         3.2          1.3         0.2  setosa  6.653485
4          4.6         3.1          1.5         0.2  setosa  6.870402
5          5.0         3.6          1.4         0.2  setosa  6.382339
6          5.4         3.9          1.7         0.4  setosa  6.707714

We can also fit a multilinear model as shown below:

head(multi_model_2(iris[1:50,],iris[50:99,],"Sepal.Length","Petal.Length + Sepal.Width","lm"))
    
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species predicted
1          5.1         3.5          1.4         0.2  setosa  4.902999
2          4.9         3.0          1.4         0.2  setosa  5.771541
3          4.7         3.2          1.3         0.2  setosa  5.714857
4          4.6         3.1          1.5         0.2  setosa  5.761483
5          5.0         3.6          1.4         0.2  setosa  4.972473
6          5.4         3.9          1.7         0.4  setosa  5.476232

As stated, the function can work with any model type and any package. However it uses a few other functions that have only been tested on models built with lm, glm, aov, lme4.

  1. fit_model

What if I just want to fit a model and not predict? fit_model does just that. It provides user friendly syntax and is more readable. Again you can build any model type. Example usage is as shown below:


# Make some data
iris1 <- iris[1:50,]
iris2 <-iris[51:100,]
lm_model <- fit_model(iris1,"Sepal.Width","Sepal.Length", "lm")
aov_model <- fit_model(iris,"Sepal.Width","Species", "aov")

# Multilinear

mlm_model <- fit_model(iris1,"Sepal.Width","Sepal.Length + Petal.Length", "lm")

# glm
glm_model <- fit_model(iris1,"Sepal.Width","Sepal.Length", "glm")


To extract model attributes from the above models, we can use extract_model_info as shown below:

# extract coefficients
extract_model_info(lm_model, "coeffs")

          Estimate Std. Error   t value     Pr(>|t|)
(Intercept)  -0.5694327  0.5217119 -1.091470 2.805148e-01
Sepal.Length  0.7985283  0.1039651  7.680738 6.709843e-10

extract_model_info(lm_model, "p_value")

(Intercept) Sepal.Length 
2.805148e-01 6.709843e-10 

# get predictors
extract_model_info(lm_model,"predictors")

Sepal.Length

# get response variable
extract_model_info(lm_model,"response")
Sepal.Width

# glm
extract_model_info(glm_model, "aic")

[1] 9.800332

To get several values:

extract_model_info(lm_model,c("p_value","coeffs","predictors"))

$p_value
 (Intercept) Sepal.Length 
2.805148e-01 6.709843e-10 

$coeffs
               Estimate Std. Error   t value     Pr(>|t|)
(Intercept)  -0.5694327  0.5217119 -1.091470 2.805148e-01
Sepal.Length  0.7985283  0.1039651  7.680738 6.709843e-10

$predictors
Sepal.Length

To see currently supported model types, please see help(extract_model_info). To request support for a given model, please file an issue at: issues. Exploration of other available values is left to the user.

  1. add_model_residuals and add_model_predictions

To add predictions or residuals to a data set, we can use add_model_predictions and add_model_residuals respectively.


head(add_model_predictions(lm_model, iris1, iris2))

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species predicted
1          5.1         3.5          1.4         0.2  setosa  5.020265
2          4.9         3.0          1.4         0.2  setosa  4.541148
3          4.7         3.2          1.3         0.2  setosa  4.940413
4          4.6         3.1          1.5         0.2  setosa  3.822473
5          5.0         3.6          1.4         0.2  setosa  4.621001
6          5.4         3.9          1.7         0.4  setosa  3.982179


head(add_model_residuals(lm_model, iris1))

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   residuals
1          5.1         3.5          1.4         0.2  setosa -0.00306166
2          4.9         3.0          1.4         0.2  setosa -0.34335600
3          4.7         3.2          1.3         0.2  setosa  0.01634966
4          4.6         3.1          1.5         0.2  setosa -0.00379751
5          5.0         3.6          1.4         0.2  setosa  0.17679117
6          5.4         3.9          1.7         0.4  setosa  0.15737985

# dplyr compatible
#library(dplyr)
iris1 %>% 
add_model_predictions(model=lm_model, new_data = iris2) %>% 
head()

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species predicted
1          5.1         3.5          1.4         0.2  setosa  5.020265
2          4.9         3.0          1.4         0.2  setosa  4.541148
3          4.7         3.2          1.3         0.2  setosa  4.940413
4          4.6         3.1          1.5         0.2  setosa  3.822473
5          5.0         3.6          1.4         0.2  setosa  4.621001
6          5.4         3.9          1.7         0.4  setosa  3.982179
  1. get_var_corr

As can probably(hopefully) be guessed from the name, this provides a convenient way to get variable correlations. It enables one to get correlation between one variable and all other variables in the data set.

Previously, one would set get_all to TRUE if they wanted to get correlations between all variables. This argument has been dropped in favor of simply supplying an optional other_vars vector if one does not want to get all correlations.

Sample usage:


# get all correlations
corrs <- get_var_corr(mtcars,comparison_var="mpg")


The result is as follows(default pearson):


head(corrs)

    comparison_Var other_Var      p_value  correlation    lower_ci   upper_ci
1            mpg       cyl   6.112687e-10  -0.8521620 -0.92576936 -0.7163171
2            mpg      disp   9.380327e-10  -0.8475514 -0.92335937 -0.7081376
3            mpg        hp   1.787835e-07  -0.7761684 -0.88526861 -0.5860994
4            mpg      drat   1.776240e-05   0.6811719  0.43604838  0.8322010
5            mpg        wt   1.293959e-10  -0.8676594 -0.93382641 -0.7440872
6            mpg      qsec   1.708199e-02   0.4186840  0.08195487  0.6696186


Previously, one would also set drop_columns to TRUE if they wanted to drop factor columns. Now, a user simply provides a character vector specifying which column types(classes) should be dropped. It defaults to c("character","factor").

# purely demonstrative
get_var_corr(iris,"Sepal.Length",other_vars="Petal.Length",drop_columns= c("factor","character"), method="spearman", exact=FALSE)

#  comparison_Var    other_var      p.value   correlation
# 1   Sepal.Length Petal.Length 3.443087e-50   0.8818981

  1. A closely related function is get_var_corr_(note the underscore) that enables finer control over which correlations to obtain with the ability to perform combination wise correlations.
head(get_var_corr_(mtcars, method="spearman", exact=FALSE))

   comparison_var  other_var      p.value correlation
1            mpg       cyl 4.690287e-13  -0.9108013
2            mpg      disp 6.370336e-13  -0.9088824
3            mpg        hp 5.085969e-12  -0.8946646
4            mpg      drat 5.381347e-05   0.6514555
5            mpg        wt 1.487595e-11  -0.8864220
6            mpg      qsec 7.055765e-03   0.4669358

To use only a few columns, we specify a list of columns in subset_cols:


head(get_var_corr_(mtcars, method="spearman", exact=FALSE, subset_cols=list(c("mpg","disp"),  c("wt","drat"))))
  
      comparison_Var  other_var      p.value  correlation
4             mpg      drat     5.381347e-05   0.6514555
5             mpg        wt    1.487595e-11  -0.8864220
21           disp      drat    1.613884e-05  -0.6835921
22           disp        wt    3.346362e-12   0.8977064


  1. To plot the above, one can use plot_corr as shown below:

plot_corr(mtcars,show_which = "corr",
          round_which = "correlation",decimals = 2,
          x="other_var", 
           y="comparison_var",plot_style = "squares"
          ,width = 1.1,
           custom_cols = c("green","blue","red"),colour_by = "correlation")

Correlations Plot

To show significance instead(ie based on p values), one can set show_which to "signif". The default is show_which="corr" which will display correlations.

plot_corr(mtcars, x="other_var", y="comparison_var",
          plot_style = "squares",show_which = "signif",
          colour_by = "p.value", custom_cols = c("indianred4","skyblue","red"))


Signif plot

You can explore more options via help(plot_corr) or ?plot_corr. Since the function uses ggplot2 backend, one can change themes by adding theme components to the plot.

  1. rowdiff

If one needs to obtain differences between rows, rowdiff is designed to do exactly that.


head(rowdiff(iris,direction="reverse", exclude="factor"))


This gives us the following result:


Sepal.Length Sepal.Width Petal.Length Petal.Width
1           NA          NA           NA          NA
2         -0.2        -0.5          0.0         0.0
3         -0.2         0.2         -0.1         0.0
4         -0.1        -0.1          0.2         0.0
5          0.4         0.5         -0.1         0.0
6          0.4         0.3          0.3         0.2


To replace the calculation induced NAs, we can set na.rm to TRUE and specify na_action(uses na_replace).


# since reverse, first value is replaced with 0.
head(rowdiff(mtcars,direction="reverse", na.rm=TRUE, na_action="value", value=0))

   mpg  cyl disp  hp  drat     wt  qsec vs am gear carb
1  0.0   0    0   0  0.00  0.000  0.00  0  0    0    0
2  0.0   0    0   0  0.00  0.255  0.56  0  0    0    0
3  1.8  -2  -52 -17 -0.05 -0.555  1.59  1  0    0   -3
4 -1.4   2  150  17 -0.77  0.895  0.83  0 -1   -1    0
5 -2.7   2  102  65  0.07  0.225 -2.42 -1  0    0    1
6 -0.6  -2 -135 -70 -0.39  0.020  3.20  1  0    0   -1


na_replace used above works as shown below.


test_data <- data.frame(A=c(1,2,NA,NA), B= c(1,3,4,NA))
# replace NAs with the mean of the non NA values
na_replace(test_data, how="get_mode")
  A B
1 1 1
2 2 3
3 1 4
4 1 1

The above is less useful since one might want to replace values by group. Using na_replace_grouped, one can achieve just that.


test_groups = data.frame(groups=c(1,1,1,2,2,2), values = c(2,NA,2,3,NA,3))

na_replace_grouped(test_groups,group_by="groups",how="get_mode")

 groups values
1      1      2
2      1      2
3      1      2
4      2      3
5      2      3
6      2      3

The use of mean,sd,etc is no longer supported. Use mde instead which is focused on missingness.

Space constraints mean that a detailed exploration of the package cannot be made. A more thorough walkthrough is provided in the vignettes that can be opened as shown below:


browseVignettes("manymodelr")


For a complete list of available functions, please use:

 
help(package="manymodelr")

 

Please note that the 'manymodelr' project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Thank You and Happy Coding!