Hello
I’m working with four single-cell RNA-seq samples, each from a different genotype (WT, two single knockouts, and one double knockout). To integrate these samples, I used the scVI model, as shown below:
# Sets up the AnnData object for this model
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key='sample')
# Create the SCVI model
model_scvi = scvi.model.SCVI(adata)
# Train the mmodel_scvi
model_scvi.train(max_epochs=500, early_stopping=True)
After training, I obtained the latent representation and used it to compute a UMAP embedding:
# Get the latent representation of the SCVI model
adata.obsm["X_scVI"] = model_scvi.get_latent_representation()
# Calcualte the neighbors and UMAP using the X_scVI embedding
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
# Rename cell_type to paper_cell_type
adata.obs.rename(columns={'cell_type' : 'paper_cell_type'}, inplace=True)
# Cluster cells with leiden algorithm
sc.tl.leiden(adata, resolution=0.7, random_state=124)
# Plot UMAPs. Visually identify leiden clusters corresponding to the
# cells annotations in the paper
sc.pl.umap(adata, color=['leiden'])
Next, I wanted to examine gene expression across clusters. I tried two approaches:
1 - Using the normalized expression from scVI's get_normalized_expression:
# Get the scvi normalized counts
adata.layers['scvi_normalized'] = model_scvi.get_normalized_expression(library_size=1e4)
2 - Using standard log-normalized counts:
# Log normalize the counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Save a copy of the log_normalized counts in the layer log_normalized
adata.layers['log_normalized'] = adata.X.copy()
Which of these normalized matrices is better suited for downstream analysis? I haven't found benchmarks comparing the results of scVI's get_normalized_expression with standard log normalization. When I plot gene expression on the UMAP using both methods, the results look quite different:
With log normalized matrix:
With get_normalized_expression matrix:
Thank you!