-
Notifications
You must be signed in to change notification settings - Fork 8
/
nebulosa_seurat.Rmd
222 lines (167 loc) · 6.75 KB
/
nebulosa_seurat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: "Visualization of gene expression with Nebulosa (in Seurat)"
author: "Jose Alquicira-Hernandez"
package: Nebulosa
output:
BiocStyle::html_document:
self_contained: yes
toc: true
toc_float: true
toc_depth: 2
code_folding: show
vignette: >
%\VignetteIndexEntry{Visualization of gene expression with Nebulosa (in Seurat)}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Overview
Due to the sparsity observed in single-cell data (e.g. RNA-seq, ATAC-seq), the
visualization of cell features (e.g. gene, peak) is frequently affected and
unclear, especially when it is overlaid with clustering to annotate cell
types. `Nebulosa` is an R package to visualize data from single cells based on
kernel density estimation. It aims to recover the signal from dropped-out
features by incorporating the similarity between cells allowing a "convolution"
of the cell features.
# Import libraries
For this vignette, let's use `Nebulosa` with the `Seurat` package.
First, we'll do a brief/standard data processing.
```{r import_libraries}
library("Nebulosa")
library("Seurat")
library("BiocFileCache")
```
# Data pre-processing
Let's download a dataset of 3k PBMCs (available from 10X Genomics). This same
dataset is commonly used in Seurat vignettes. The code below will download,
store, and uncompress the data in a temporary directory.
```{r download_and_untar_file}
bfc <- BiocFileCache(ask = FALSE)
data_file <- bfcrpath(bfc, file.path(
"https://s3-us-west-2.amazonaws.com/10x.files/samples/cell",
"pbmc3k",
"pbmc3k_filtered_gene_bc_matrices.tar.gz"
))
untar(data_file, exdir = tempdir())
```
Then, we can read the gene expression matrix using the `Read10X` from `Seurat`
```{r read_data}
data <- Read10X(data.dir = file.path(tempdir(),
"filtered_gene_bc_matrices",
"hg19"
))
```
Let's create a Seurat object with features being expressed in at least 3 cells
and cells expressing at least 200 genes.
```{r create_seurat_object}
pbmc <- CreateSeuratObject(
counts = data,
project = "pbmc3k",
min.cells = 3,
min.features = 200
)
```
Remove outlier cells based on the number of genes being expressed in each cell
(below 2500 genes) and expression of mitochondrial genes (below 5%).
```{r qc}
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA < 2500 & percent.mt < 5)
```
## Data normalization
Let's use `SCTransform` to stabilize the variance of the data by regressing out
the effect of the sequencing depth from each cell.
```{r norm, message=FALSE, warning=FALSE}
pbmc <- SCTransform(pbmc, verbose = FALSE)
```
## Dimensionality reduction
Once the data is normalized and scaled, we can run a
_Principal Component Analysis_ (PCA) first to reduce the dimensions of our
data from 26286 features to 50 principal components. To visualize the principal
components, we can run a
_Uniform Manifold Approximation and Projection for Dimension Reduction_ (UMAP)
using the first 30 principal components to obtain a two-dimentional space.
```{r dim_red, message=FALSE, warning=FALSE}
pbmc <- RunPCA(pbmc)
pbmc <- RunUMAP(pbmc, dims = 1:30)
```
## Clustering
To assess cell similarity, let's cluster the data by constructing a
_Shared Nearest Neighbor _(SNN) _Graph_ using the first 30 principal components
and applying the _Louvain algorithm_.
```{r clustering, message=FALSE, warning=FALSE}
pbmc <- FindNeighbors(pbmc, dims = 1:30)
pbmc <- FindClusters(pbmc)
```
# Visualize data with `Nebulosa`
The main function from `Nebulosa` is the `plot_density`. For usability, it
resembles the `FeaturePlot` function from `Seurat`.
Let's plot the kernel density estimate for `CD4` as follows
```{r plot_cd4}
plot_density(pbmc, "CD4")
```
For comparison, let's also plot a standard scatterplot using `Seurat`
```{r cd4_comparison}
FeaturePlot(pbmc, "CD4")
FeaturePlot(pbmc, "CD4", order = TRUE)
```
By smoothing the data, `Nebulosa` allows a better visualization of the global
expression of CD4 in myeloid and CD4+ T cells. Notice that the "random"
expression of CD4 in other areas of the plot is removed as the expression of
this gene is not supported by many cells in those areas. Furthermore, CD4+
cells appear to show considerable dropout rate.
Let's plot the expression of CD4 with `Nebulosa` next to the clustering results
```{r cd4_and_clustering}
DimPlot(pbmc, label = TRUE, repel = TRUE)
```
We can now easily identify that clusters `0` and `2` correspond to CD4+ T cells
if we plot CD3D too.
```{r plot_cd3d}
plot_density(pbmc, "CD3D")
```
# Multi-feature visualization
Characterize cell populations usually relies in more than a single marker.
Nebulosa allows the visualization of the joint density of from multiple
features in a single plot.
## Identifying Naive CD8+ T cells
Users familiarized with PBMC datasets may know that CD8+ CCR7+ cells usually
cluster next to CD4+ CCR7+ and separate from the rest of CD8+ cells. Let's aim
to identify Naive CD8+ T cells. To do so, we can just add another gene to the
vector containing the features to visualize.
```{r fig.height=10}
p3 <- plot_density(pbmc, c("CD8A", "CCR7"))
p3 + plot_layout(ncol = 1)
```
`Nebulosa` can return a *joint density* plot by multiplying the densities
from all query genes by using the `joint = TRUE` parameter:
```{r fig.height=14}
p4 <- plot_density(pbmc, c("CD8A", "CCR7"), joint = TRUE)
p4 + plot_layout(ncol = 1)
```
When compared to the clustering results, we can easily identify that Naive
CD8+ T cells correspond to cluster `8`.
`Nebulosa` returns the density estimates for each gene along with the joint
density across all provided genes. By setting `combine = FALSE`, we can obtain
a list of ggplot objects where the last plot corresponds to the joint density
estimate.
```{r}
p_list <- plot_density(pbmc, c("CD8A", "CCR7"), joint = TRUE, combine = FALSE)
p_list[[length(p_list)]]
```
## Identifying Naive CD4+ T cells
Likewise, the identification of Naive CD4+ T cells becomes straightforward by
combining `CD4` and `CCR7`:
```{r fig.height=14}
p4 <- plot_density(pbmc, c("CD4", "CCR7"), joint = TRUE)
p4 + plot_layout(ncol = 1)
```
Notice that these cells are mainly constrained to cluster `0`
```{r fig.height=10}
p4[[3]] / DimPlot(pbmc, label = TRUE, repel = TRUE)
```
# Conclusions
In summary,`Nebulosa`can be useful to recover the signal from dropped-out genes
and improve their visualization in a two-dimensional space. We recommend using
`Nebulosa` particularly for dropped-out genes. For fairly well-expressed genes,
the direct visualization of the gene expression may be preferable. We encourage
users to use `Nebulosa` along with the core visualization methods from the
`Seurat` and `Bioconductor` environments as well as other visualization methods
to draw more informed conclusions about their data.