Entropy sorting of single-cell RNA sequencing data reveals the inner cell mass in the human pre-implantation embryo

doi:10.1016/j.stemcr.2022.09.007

. 2023 Jan 10;18(1):47-63.

doi: 10.1016/j.stemcr.2022.09.007. Epub 2022 Oct 13.

Entropy sorting of single-cell RNA sequencing data reveals the inner cell mass in the human pre-implantation embryo

Arthur Radley¹, Elena Corujo-Simon², Jennifer Nichols², Austin Smith³, Sara-Jane Dunn⁴

Affiliations

¹ Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Jeffrey Cheah Biomedical Centre, Puddicombe Way, Cambridge Biomedical Campus, Cambridge CB2 0AW, UK.
² MRC Human Genetics Unit, MRC Institute of Genetics and Cancer, The University of Edinburgh, Western General Hospital, Edinburgh EH4 2XU, UK.
³ Living Systems Institute, University of Exeter, Stocker Road, Exeter EX4 4QD, UK. Electronic address: austin.smith@exeter.ac.uk.
⁴ Microsoft Research, 21 Station Road, Cambridge CB1 2FB, UK. Electronic address: sjdunn@deepmind.com.

PMID: 36240776
PMCID: PMC9859930
DOI: 10.1016/j.stemcr.2022.09.007

Entropy sorting of single-cell RNA sequencing data reveals the inner cell mass in the human pre-implantation embryo

Arthur Radley et al. Stem Cell Reports. 2023.

. 2023 Jan 10;18(1):47-63.

doi: 10.1016/j.stemcr.2022.09.007. Epub 2022 Oct 13.

Authors

Arthur Radley¹, Elena Corujo-Simon², Jennifer Nichols², Austin Smith³, Sara-Jane Dunn⁴

Affiliations

¹ Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Jeffrey Cheah Biomedical Centre, Puddicombe Way, Cambridge Biomedical Campus, Cambridge CB2 0AW, UK.
² MRC Human Genetics Unit, MRC Institute of Genetics and Cancer, The University of Edinburgh, Western General Hospital, Edinburgh EH4 2XU, UK.
³ Living Systems Institute, University of Exeter, Stocker Road, Exeter EX4 4QD, UK. Electronic address: austin.smith@exeter.ac.uk.
⁴ Microsoft Research, 21 Station Road, Cambridge CB1 2FB, UK. Electronic address: sjdunn@deepmind.com.

PMID: 36240776
PMCID: PMC9859930
DOI: 10.1016/j.stemcr.2022.09.007

Abstract

A major challenge in single-cell gene expression analysis is to discern meaningful cellular heterogeneity from technical or biological noise. To address this challenge, we present entropy sorting (ES), a mathematical framework that distinguishes genes indicative of cell identity. ES achieves this in an unsupervised manner by quantifying if observed correlations between features are more likely to have occurred due to random chance versus a dependent relationship, without the need for any user-defined significance threshold. On synthetic data, we demonstrate the removal of noisy signals to reveal a higher resolution of gene expression patterns than commonly used feature selection methods. We then apply ES to human pre-implantation embryo single-cell RNA sequencing (scRNA-seq) data. Previous studies failed to unambiguously identify early inner cell mass (ICM), suggesting that the human embryo may diverge from the mouse paradigm. In contrast, ES resolves the ICM and reveals sequential lineage bifurcations as in the classical model. ES thus provides a powerful approach for maximizing information extraction from high-dimensional datasets such as scRNA-seq data.

Keywords: feature selection; human embryo inner cell mass; single-cell RNA sequencing.

PubMed Disclaimer

Conflict of interest statement

Conflict of interests Sara-Jane Dunn was an employee at Microsoft Research during this study and is currently employed at DeepMind. Microsoft Research provided co-funding for Arthur Radley’s research council studentship and access to computational resources. Neither Microsoft Research nor DeepMind have directed any aspect of the study nor exerted any commercial rights over the results.

Figures

**Figure 1**
Quantifying the dependent relationship between two features (A) A toy example. The states of each RF sample are sorted into two groups. The QF is then inspected while maintaining the RF sample ordering. (B) The ESE for calculating CE. G₁ (group 1) and G₂ (group 2) are the number of minority or majority states of the RF respectively. QF_m is the total number of QF minority states. For brevity, we use x to denote QF_m,G1, the number of QF minority states that overlap with the RF minority states, which is the only independent variable. Each constant is highlighted with their corresponding colors in (A). (C) Given any observed pair of features, we may form an ESE parabola by fixing the constants of the ESE and calculating the CE for different values of x. Points (1) and (4) correspond to the local and global minimum. (2) is the maximum CE, where the RF and QF are independent. (3) is the CE corresponding to the observed arrangement in (A). Each of (1)–(4) is illustrated by an example arrangement.

**Figure 2**
ES divergence and error potential (A) ESE parabolas highlighting three toy examples to demonstrate divergence: (i) ground truth for a partially dependent system; (ii) an FN dropout added that does not produce observable divergence on the ESE parabola; (iii) the addition of an FN dropout that generates observable divergence. (B) DPC (Equation 6) introduced to the RF/QF pair due to erroneous data points (example iv) under both the assumption of either RF/QF dependence (green line) or independence (orange line).

**Figure 3**
FFAVES and ESFW workflow Yellow, blue, and green boxes provide the proposed workflow to apply FFAVES and ESFW to high-dimensional data for unsupervised feature selection. The purple and red boxes outline each algorithm.

**Figure 4**
FFAVES accurately identifies false negatives and false positives (A) The synthetic scRNA-seq dataset. (B) Convergence of FN/FP data points identified after each cycle of FFAVES. (C and D) Precision and recall scores of FNs and FPs identified by FFAVES, respectively. (E) Heatmaps of pairwise feature ESSs. Top: before identification of FNs and FPs by FFAVES. Middle: after application of FFAVES. Bottom: ground truth, i.e., synthetic data prior to introduction of FN dropouts. (F) Silhouette scores of the seven main gene groups calculated from the respective ESSs in (E). Dashed lines outline the ground truth silhouette scores. (G) Reduction in FN errors that were intentionally introduced by sub-optimal feature discretization.

**Figure 5**
Performance of FFAVES and ESFW against comparable software (A) UMAPs of the synthetic dataset before and after imputation. The top two plots show the synthetic data before and after FN dropouts were introduced, with no imputation. (B) Silhouette scores for each of the six main clusters of cells in the synthetic dataset. Black dashed lines mark the silhouette scores of the ground truth data prior to the introduction of FNs. (C) Feature importance weights for all genes in the synthetic data according to ESFW. Top: feature weights estimated from the synthetic data with FNs introduced to the ground truth. Bottom: feature weights estimates after FFAVES has identified statistically significant divergent data points. (D) Precision/recall curves for distinguishing structured and randomly expressed genes. Each line is generated from the ranked gene lists of the respective feature selection software.

**Figure 6**
Independent validation of the FFAVES + ESFW human pre-implantation embryo embedding (A) The FFAVES + ESFW UMAP embedding overlaid with different label information: (left to right) the datasets that samples originate from, the time point labels from the Petropoulos dataset, and the cell type labels for the Petropoulos dataset that were independently assigned by Stirparo et al., (2018). (B) Example epiblast, hypoblast, and trophectoderm marker expression. See Figure S8 for more examples. (C) Predicted cell type probabilities of individual cells from a classifier trained on human pre-implantation embryo scRNA-seq data from the independent Yanagida et al., (2021) dataset. Gray samples are those that were not processed by the classifier to avoid confounding variables such as batch effects. See Figure S6B for the same analysis with Macaca classifiers. (D) Nearest neighbor embedding where each cell is connected by lines to their 10 most similar samples according to gene expression. See Figure S6C for individual cell type nearest neighbor embeddings.

**Figure 7**
Identification of potential ICM markers (A) ICM markers were selected based on localized expression in the ICM population of the FFAVES + ESFW UMAP embedding (top row) corroborated in the tSNE embedding generated by Yanagida et al. (bottom row). See Figure S9 and online methods for additional proposed ICM markers. (B) Confocal images of human embryos immunostained for LAMA4 cell surface protein together with SOX17 and OCT4 nuclear transcription factors. Nuclei are visualized with Hoechst staining. The zona pellucida has been removed at E5 and E6 but not at E7 due to the embryo beginning to hatch. Staining patterns were consistent for all embryos examined: E5, n = 8; E6, n = 5; E7, n = 4. Scale bar represents 50 μm.

See this image and copyright information in PMC

Cited by

Evidence implicating sequential commitment of the founder lineages in the human blastocyst by order of hypoblast gene activation.
Corujo-Simon E, Radley AH, Nichols J. Corujo-Simon E, et al. Development. 2023 May 15;150(10):dev201522. doi: 10.1242/dev.201522. Epub 2023 May 24. Development. 2023. PMID: 37102672 Free PMC article.
The human embryo selection arena is associated with transposable element activity.
Osnato A, Pasque V, David L. Osnato A, et al. PLoS Biol. 2023 Jun 20;21(6):e3002153. doi: 10.1371/journal.pbio.3002153. eCollection 2023 Jun. PLoS Biol. 2023. PMID: 37348048 Free PMC article.
Accessing the human trophoblast stem cell state from pluripotent and somatic cells.
Karvas RM, David L, Theunissen TW. Karvas RM, et al. Cell Mol Life Sci. 2022 Nov 25;79(12):604. doi: 10.1007/s00018-022-04549-y. Cell Mol Life Sci. 2022. PMID: 36434136 Free PMC article. Review.
Self-renewing human naïve pluripotent stem cells dedifferentiate in 3D culture and form blastoids spontaneously.
Guo M, Wu J, Chen C, Wang X, Gong A, Guan W, Karvas RM, Wang K, Min M, Wang Y, Theunissen TW, Gao S, Silva JCR. Guo M, et al. Nat Commun. 2024 Jan 22;15(1):668. doi: 10.1038/s41467-024-44969-x. Nat Commun. 2024. PMID: 38253551 Free PMC article.
Branching topology of the human embryo transcriptome revealed by Entropy Sort Feature Weighting.
Radley A, Boeing S, Smith A. Radley A, et al. Development. 2024 Jun 1;151(11):dev202832. doi: 10.1242/dev.202832. Epub 2024 Jun 13. Development. 2024. PMID: 38691188 Free PMC article.

See all "Cited by" articles

References

1. Altman N., Krzywinski M. The curse(s) of dimensionality. 2018;15:399–400. - PubMed
1. Amrani K.E., Alanis-Lobato G., Mah N., Kurtz A., Andrade-Navarro M.A. Detection of condition-specific marker genes from RNA-seq data with MGFR. PeerJ. 2019;7:e6970. doi: 10.7717/peerj.6970. - DOI - PMC - PubMed
1. Anaparthy N., Ho Y.J., Martelotto L., Hammell M., Hicks J. Single-cell applications of next-generation sequencing. Cold Spring Harb. Perspect. Med. 2019;9 doi: 10.1101/CSHPERSPECT.A026898. - DOI - PMC - PubMed
1. Andrews T.S., Martin H. False signals induced by single-cell imputation. F1000Res. 2019:1740. doi: 10.12688/f1000research.16613.2. - DOI - PMC - PubMed
1. Angerer P., Simon L., Tritschler S., Wolf F.A., Fischer D., Theis F.J. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 2017;4:85–91. doi: 10.1016/J.COISB.2017.07.004. - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

[1] Altman N., Krzywinski M. The curse(s) of dimensionality. 2018;15:399–400. - PubMed

[2] Altman N., Krzywinski M. The curse(s) of dimensionality. 2018;15:399–400. - PubMed

[3] Amrani K.E., Alanis-Lobato G., Mah N., Kurtz A., Andrade-Navarro M.A. Detection of condition-specific marker genes from RNA-seq data with MGFR. PeerJ. 2019;7:e6970. doi: 10.7717/peerj.6970. - DOI - PMC - PubMed

[4] Amrani K.E., Alanis-Lobato G., Mah N., Kurtz A., Andrade-Navarro M.A. Detection of condition-specific marker genes from RNA-seq data with MGFR. PeerJ. 2019;7:e6970. doi: 10.7717/peerj.6970. - DOI - PMC - PubMed

[5] Anaparthy N., Ho Y.J., Martelotto L., Hammell M., Hicks J. Single-cell applications of next-generation sequencing. Cold Spring Harb. Perspect. Med. 2019;9 doi: 10.1101/CSHPERSPECT.A026898. - DOI - PMC - PubMed

[6] Anaparthy N., Ho Y.J., Martelotto L., Hammell M., Hicks J. Single-cell applications of next-generation sequencing. Cold Spring Harb. Perspect. Med. 2019;9 doi: 10.1101/CSHPERSPECT.A026898. - DOI - PMC - PubMed

[7] Andrews T.S., Martin H. False signals induced by single-cell imputation. F1000Res. 2019:1740. doi: 10.12688/f1000research.16613.2. - DOI - PMC - PubMed

[8] Andrews T.S., Martin H. False signals induced by single-cell imputation. F1000Res. 2019:1740. doi: 10.12688/f1000research.16613.2. - DOI - PMC - PubMed

[9] Angerer P., Simon L., Tritschler S., Wolf F.A., Fischer D., Theis F.J. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 2017;4:85–91. doi: 10.1016/J.COISB.2017.07.004. - DOI

[10] Angerer P., Simon L., Tritschler S., Wolf F.A., Fischer D., Theis F.J. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 2017;4:85–91. doi: 10.1016/J.COISB.2017.07.004. - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Entropy sorting of single-cell RNA sequencing data reveals the inner cell mass in the human pre-implantation embryo

Affiliations

Entropy sorting of single-cell RNA sequencing data reveals the inner cell mass in the human pre-implantation embryo

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources