The scranpy package provides Python bindings to the single-cell analysis methods in libscran and related C++ libraries. It performs the standard steps in a typical single-cell analysis including quality control, normalization, feature selection, dimensionality reduction, clustering and marker detection. scranpy makes heavy use of the BiocPy data structures in its user interface, while it uses the mattress package to provide a C++ representation of the underlying matrix data. This package is effectively a mirror of its counterparts in Javascript (scran.js) and R (scran.chan), which are based on the same underlying C++ libraries and concepts.
Let's load in the famous PBMC 4k dataset from 10X Genomics (available here):
import singlecellexperiment
sce = singlecellexperiment.read_tenx_h5("pbmc4k-tenx.h5")
Then we just need to call one of scranpy's analyze()
functions.
(We do have to tell it what the mitochondrial genes are, though.)
import scranpy
options = scranpy.AnalyzeOptions()
options.per_cell_rna_qc_metrics_options.subsets = {
"mito": scranpy.guess_mito_from_symbols(sce.row_data["name"], "mt-")
}
results = scranpy.analyze_sce(sce, options=options)
This will perform all of the usual steps for a routine single-cell analysis, as described in Bioconductor's Orchestrating single cell analysis book. It returns an object containing clusters, t-SNEs, UMAPs, marker genes, and so on:
results.clusters
results.tsne
results.umap
results.rna_markers
We won't go over the theory here as it's explained more thoroughly in the book. Check out the reference documentation for more details.
To demonstrate, let's grab two batches of PBMC datasets from 10X Genomics (again, available here):
import singlecellexperiment
sce3k = singlecellexperiment.read_tenx_h5("pbmc3k-tenx.h5")
sce4k = singlecellexperiment.read_tenx_h5("pbmc4k-tenx.h5")
They don't have the same features, so we'll just take the intersection of their Ensembl IDs before combining them:
import biocutils
common = biocutils.intersect(sce3k.row_data["id"], sce4k.row_data["id"])
sce3k_common = sce3k[biocutils.match(common, sce3k.row_data["id"]), :]
sce4k_common = sce4k[biocutils.match(common, sce4k.row_data["id"]), :]
import scipy.sparse
combined = scipy.sparse.hstack((sce3k_common.assay(0), sce4k_common.assay(0)))
batch = ["3k"] * sce3k_common.shape[1] + ["4k"] * sce4k_common.shape[1]
We can now perform a batch-aware analysis:
import scranpy
options = scranpy.AnalyzeOptions()
options.per_cell_rna_qc_metrics_options.subsets = {
"mito": scranpy.guess_mito_from_symbols(sce3k_common.row_data["name"], "mt-")
}
options.miscellaneous_options.block = batch
results = scranpy.analyze(combined, options=options)
This yields mostly the same set of results as before, but with an extra MNN-corrected embedding for clustering, visualization, etc.
results.mnn
Let's grab a 10X Genomics immune profiling dataset (see here):
import singlecellexperiment
sce = singlecellexperiment.read_tenx_h5("immune_3.0.0-tenx.h5")
We need to split it to genes and ADTs:
is_gene = [x == "Gene Expression" for x in sce.row_data["feature_type"]]
gene_data = sce[is_gene,:]
is_adt = [x == "Antibody Capture" for x in sce.row_data["feature_type"]]
adt_data = sce[is_adt,:]
And now we can run the analysis:
import scranpy
options = scranpy.AnalyzeOptions()
options.per_cell_rna_qc_metrics_options.subsets = {
"mito": scranpy.guess_mito_from_symbols(gene_data.row_data["name"], "mt-")
}
options.per_cell_adt_qc_metrics_options.subsets = {
"igg": [n.lower().startswith("igg") for n in adt_data.row_data["name"]]
}
results = scranpy.analyze_se(gene_data, adt_se = adt_data, options=options)
This returns ADT-specific results in the relevant fields, as well as a set of combined PCs for use in clustering, visualization, etc.
results.adt_size_factors
results.adt_markers
results.combined_pcs
Most parameters can be changed by setting the relevant fields in the AnalyzeOptions
object.
For example, we can modify the number of neighbors and resolution used for graph-based clustering:
options.build_snn_graph_options.num_neighbors = 10
options.miscellaneous_options.snn_graph_multilevel_resolution = 2
Or we can fiddle the the various dimensionality reduction parameters:
options.run_pca_options.rank = 50
options.run_tsne_options.perplexity = 20
options.run_umap_options.min_dist = 0.5
The AnalyzeOptions
has a few convenience methods to easily set the same parameter across multiple *_options
attributes.
For example, to enable parallel processing in every step:
options.set_threads(5)
Advanced users can even obtain the sequence of steps used internally by analyze()
by calling it with dry_run = True
:
commands = scranpy.analyze(sce, dry_run = True)
print(commands)
## import scranpy
## import numpy
##
## results = AnalyzeResults()
## ...
Users can then add, remove or replace steps as desired.
Steps to setup dependencies -
-
initialize git submodules in
extern/libscran
. -
run
cmake .
inside theextern/knncolle
to download the annoy library. a future version of this will use a cmake to setup the extern directory.
First one needs to build the extern library, this would generate a shared object file to src/scranpy/core-[*].so
python setup.py build_ext --inplace
For typical development workflows, run this for tests
python setup.py build_ext --inplace && tox
To rebuild the ctypes bindings cpptypes:
cpptypes src/scranpy/lib --py src/scranpy/_cpphelpers.py --cpp src/scranpy/lib/bindings.cpp --dll _core
To rebuild the dry run analysis source code:
./scripts/dryrun.py src/scranpy/analyze/live_analyze.py > src/scranpy/analyze/dry_analyze.py