generated from jtr13/bookdown-template
-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path02-cleaning_data.Rmd
89 lines (59 loc) · 3.66 KB
/
02-cleaning_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# Introduction to R
<div style="text-align:center;">
```{r, echo = F}
knitr::include_graphics("img/abacus.png")
```
</div>
What you'll have learned by the end of the chapter: reading and writing, exploring (and optionally
visualising) data.
## Reading in data with R
Your first job is to actually get the following datasets into an R session.
First install the `{rio}` package (if you don't have it already), then download the following datasets:
- [mtcars.csv](https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/mtcars.csv)
- [mtcars.dta](https://github.com/b-rodrigues/modern_R/raw/master/datasets/mtcars.dta)
- [mtcars.sas7bdat](https://github.com/b-rodrigues/modern_R/raw/master/datasets/mtcars.sas7bdat)
- [multi.xlsx](https://github.com/b-rodrigues/modern_R/raw/master/datasets/multi.xlsx)
Also download the following 4 `csv` files and put them in a directory called `unemployment`:
- [unemp_2013.csv](https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/unemp_2013.csv)
- [unemp_2014.csv](https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/unemp_2014.csv)
- [unemp_2015.csv](https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/unemp_2015.csv)
- [unemp_2016.csv](https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/unemp_2016.csv)
Finally, download this one as well, but put it in a folder called `problem`:
- [mtcars.csv](https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/problems/mtcars.csv)
and take a look at chapter 3 of my other book, [Modern R with the {tidyverse}](https://b-rodrigues.github.io/modern_R/reading-and-writing-data.html)
and follow along. This will teach you to import and export data.
`{rio}` is some kind of wrapper around many packages. You can keep using `{rio}`, but it is also a good
idea to know which packages are used under the hood by `{rio}`. For this, you can take a look at this
[vignette](https://cran.r-project.org/web/packages/rio/vignettes/rio.html).
If you need to import very large datasets (potentially several GBs), you might want to look at
packages like `{vroom}` ([this
benchmark](https://vroom.r-lib.org/articles/benchmarks.html#reading-delimited-files) shows a 1.5G
csv file getting imported in seconds by `{vroom}`. For even larger files, take a look at `{arrow}`
[here](https://arrow.apache.org/docs/r/). This package is able to efficiently read very large files
(`csv`, `json`, `parquet` and `feather` formats).
## A little aside on pipes
Since R version 4.1, a forward pipe `|>` is included in the standard library of the language.
It allows to do this:
```{r}
4 |>
sqrt()
```
Before R version 4.1, there was already a forward pipe, introduced with the `{magrittr}` package
(and automatically loaded by many other packages from the *tidyverse*, like `{dplyr}`):
```{r}
library(dplyr)
4 %>%
sqrt()
```
Both expressions above are equivalent to `sqrt(4)`. You will see why this is useful very soon. For now,
just know this exists and try to get used to it.
## Exploring and cleaning data with R
Take a look at
[chapter 4](https://b-rodrigues.github.io/modern_R/descriptive-statistics-and-data-manipulation.html#a-first-taste-of-data-manipulation-with-dplyr)
of my other book, ideally you should study the entirety of the chapter, but for our purposes you should
really focus on sections 4.3, 4.4, 4.5.3, 4.5.4, (optionally 4.7) and 4.8.
## Data visualization
We're not going to focus on visualization due to lack of time. If you need to create graphs,
read [chapter 5](https://b-rodrigues.github.io/modern_R/graphs.html).
## Further reading
[R for Data Science](https://r4ds.had.co.nz/)