Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 10;18(1):47-63.
doi: 10.1016/j.stemcr.2022.09.007. Epub 2022 Oct 13.

Entropy sorting of single-cell RNA sequencing data reveals the inner cell mass in the human pre-implantation embryo

Affiliations

Entropy sorting of single-cell RNA sequencing data reveals the inner cell mass in the human pre-implantation embryo

Arthur Radley et al. Stem Cell Reports. .

Abstract

A major challenge in single-cell gene expression analysis is to discern meaningful cellular heterogeneity from technical or biological noise. To address this challenge, we present entropy sorting (ES), a mathematical framework that distinguishes genes indicative of cell identity. ES achieves this in an unsupervised manner by quantifying if observed correlations between features are more likely to have occurred due to random chance versus a dependent relationship, without the need for any user-defined significance threshold. On synthetic data, we demonstrate the removal of noisy signals to reveal a higher resolution of gene expression patterns than commonly used feature selection methods. We then apply ES to human pre-implantation embryo single-cell RNA sequencing (scRNA-seq) data. Previous studies failed to unambiguously identify early inner cell mass (ICM), suggesting that the human embryo may diverge from the mouse paradigm. In contrast, ES resolves the ICM and reveals sequential lineage bifurcations as in the classical model. ES thus provides a powerful approach for maximizing information extraction from high-dimensional datasets such as scRNA-seq data.

Keywords: feature selection; human embryo inner cell mass; single-cell RNA sequencing.

PubMed Disclaimer

Conflict of interest statement

Conflict of interests Sara-Jane Dunn was an employee at Microsoft Research during this study and is currently employed at DeepMind. Microsoft Research provided co-funding for Arthur Radley’s research council studentship and access to computational resources. Neither Microsoft Research nor DeepMind have directed any aspect of the study nor exerted any commercial rights over the results.

Figures

None
Graphical abstract
Figure 1
Figure 1
Quantifying the dependent relationship between two features (A) A toy example. The states of each RF sample are sorted into two groups. The QF is then inspected while maintaining the RF sample ordering. (B) The ESE for calculating CE. G1 (group 1) and G2 (group 2) are the number of minority or majority states of the RF respectively. QFm is the total number of QF minority states. For brevity, we use x to denote QFm,G1, the number of QF minority states that overlap with the RF minority states, which is the only independent variable. Each constant is highlighted with their corresponding colors in (A). (C) Given any observed pair of features, we may form an ESE parabola by fixing the constants of the ESE and calculating the CE for different values of x. Points (1) and (4) correspond to the local and global minimum. (2) is the maximum CE, where the RF and QF are independent. (3) is the CE corresponding to the observed arrangement in (A). Each of (1)–(4) is illustrated by an example arrangement.
Figure 2
Figure 2
ES divergence and error potential (A) ESE parabolas highlighting three toy examples to demonstrate divergence: (i) ground truth for a partially dependent system; (ii) an FN dropout added that does not produce observable divergence on the ESE parabola; (iii) the addition of an FN dropout that generates observable divergence. (B) DPC (Equation 6) introduced to the RF/QF pair due to erroneous data points (example iv) under both the assumption of either RF/QF dependence (green line) or independence (orange line).
Figure 3
Figure 3
FFAVES and ESFW workflow Yellow, blue, and green boxes provide the proposed workflow to apply FFAVES and ESFW to high-dimensional data for unsupervised feature selection. The purple and red boxes outline each algorithm.
Figure 4
Figure 4
FFAVES accurately identifies false negatives and false positives (A) The synthetic scRNA-seq dataset. (B) Convergence of FN/FP data points identified after each cycle of FFAVES. (C and D) Precision and recall scores of FNs and FPs identified by FFAVES, respectively. (E) Heatmaps of pairwise feature ESSs. Top: before identification of FNs and FPs by FFAVES. Middle: after application of FFAVES. Bottom: ground truth, i.e., synthetic data prior to introduction of FN dropouts. (F) Silhouette scores of the seven main gene groups calculated from the respective ESSs in (E). Dashed lines outline the ground truth silhouette scores. (G) Reduction in FN errors that were intentionally introduced by sub-optimal feature discretization.
Figure 5
Figure 5
Performance of FFAVES and ESFW against comparable software (A) UMAPs of the synthetic dataset before and after imputation. The top two plots show the synthetic data before and after FN dropouts were introduced, with no imputation. (B) Silhouette scores for each of the six main clusters of cells in the synthetic dataset. Black dashed lines mark the silhouette scores of the ground truth data prior to the introduction of FNs. (C) Feature importance weights for all genes in the synthetic data according to ESFW. Top: feature weights estimated from the synthetic data with FNs introduced to the ground truth. Bottom: feature weights estimates after FFAVES has identified statistically significant divergent data points. (D) Precision/recall curves for distinguishing structured and randomly expressed genes. Each line is generated from the ranked gene lists of the respective feature selection software.
Figure 6
Figure 6
Independent validation of the FFAVES + ESFW human pre-implantation embryo embedding (A) The FFAVES + ESFW UMAP embedding overlaid with different label information: (left to right) the datasets that samples originate from, the time point labels from the Petropoulos dataset, and the cell type labels for the Petropoulos dataset that were independently assigned by Stirparo et al., (2018). (B) Example epiblast, hypoblast, and trophectoderm marker expression. See Figure S8 for more examples. (C) Predicted cell type probabilities of individual cells from a classifier trained on human pre-implantation embryo scRNA-seq data from the independent Yanagida et al., (2021) dataset. Gray samples are those that were not processed by the classifier to avoid confounding variables such as batch effects. See Figure S6B for the same analysis with Macaca classifiers. (D) Nearest neighbor embedding where each cell is connected by lines to their 10 most similar samples according to gene expression. See Figure S6C for individual cell type nearest neighbor embeddings.
Figure 7
Figure 7
Identification of potential ICM markers (A) ICM markers were selected based on localized expression in the ICM population of the FFAVES + ESFW UMAP embedding (top row) corroborated in the tSNE embedding generated by Yanagida et al. (bottom row). See Figure S9 and online methods for additional proposed ICM markers. (B) Confocal images of human embryos immunostained for LAMA4 cell surface protein together with SOX17 and OCT4 nuclear transcription factors. Nuclei are visualized with Hoechst staining. The zona pellucida has been removed at E5 and E6 but not at E7 due to the embryo beginning to hatch. Staining patterns were consistent for all embryos examined: E5, n = 8; E6, n = 5; E7, n = 4. Scale bar represents 50 μm.

Similar articles

Cited by

References

    1. Altman N., Krzywinski M. The curse(s) of dimensionality. 2018;15:399–400. - PubMed
    1. Amrani K.E., Alanis-Lobato G., Mah N., Kurtz A., Andrade-Navarro M.A. Detection of condition-specific marker genes from RNA-seq data with MGFR. PeerJ. 2019;7:e6970. doi: 10.7717/peerj.6970. - DOI - PMC - PubMed
    1. Anaparthy N., Ho Y.J., Martelotto L., Hammell M., Hicks J. Single-cell applications of next-generation sequencing. Cold Spring Harb. Perspect. Med. 2019;9 doi: 10.1101/CSHPERSPECT.A026898. - DOI - PMC - PubMed
    1. Andrews T.S., Martin H. False signals induced by single-cell imputation. F1000Res. 2019:1740. doi: 10.12688/f1000research.16613.2. - DOI - PMC - PubMed
    1. Angerer P., Simon L., Tritschler S., Wolf F.A., Fischer D., Theis F.J. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 2017;4:85–91. doi: 10.1016/J.COISB.2017.07.004. - DOI

Publication types

LinkOut - more resources