Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 11;11(1):332.
doi: 10.1038/s41598-020-79704-1.

Data-driven detection of subtype-specific differentially expressed genes

Affiliations

Data-driven detection of subtype-specific differentially expressed genes

Lulu Chen et al. Sci Rep. .

Abstract

Among multiple subtypes of tissue or cell, subtype-specific differentially-expressed genes (SDEGs) are defined as being most-upregulated in only one subtype but not in any other. Detecting SDEGs plays a critical role in the molecular characterization and deconvolution of multicellular complex tissues. Classic differential analysis assumes a null hypothesis whose test statistic is not subtype-specific, thus can produce a high false positive rate and/or lower detection power. Here we first introduce a One-Versus-Everyone Fold Change (OVE-FC) test for detecting SDEGs. We then propose a scaled test statistic (OVE-sFC) for assessing the statistical significance of SDEGs that applies a mixture null distribution model and a tailored permutation test. The OVE-FC/sFC test was validated on both type 1 error rate and detection power using extensive simulation data sets generated from real gene expression profiles of purified subtype samples. The OVE-FC/sFC test was then applied to two benchmark gene expression data sets of purified subtype samples and detected many known or previously unknown SDEGs. Subsequent supervised deconvolution results on synthesized bulk expression data, obtained using the SDEGs detected from the independent purified expression data by the OVE-FC/sFC test, showed superior performance in deconvolution accuracy when compared with popular peer methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Illustrative simplex of three subtypes. Given the definition of SDEGs, and for simplicity, consider a scenario where three subtypes have the mean values μ(1)μ2μ(3) and define dj=μ3-μ(2). The SDEGs and non-SDEGs can be illustrated in a simplex plot, where yellow/red points are SDEGs under the alternative hypothesis dj>0 and blue/green points are non-SDEGs under null hypothesis dj=0. OVE-sFC is to test whether dj is significantly larger than zero and thus matches the definition of SDEG.
Figure 2
Figure 2
Assessment on Type 1 error rates and p value distributions using simulated data sets under the null hypothesis, involving three subtypes with unbalanced sample sizes . 10,000 non-SDEGs are simulated with a portion of housekeeping genes taking the baseline expression levels across all the three subtypes. The remaining non-SDEGs are adjusted to exhibiting similar upregulations in two subtypes. The sample size per subtype is 3, 6, and 9. (a) Bar chart for the mean and 95% confidence interval of type I error rates with p value cutoff at 0.05 over 150 simulation-based experiments, showing both overall and subtype-specific false-positive rates corresponding to different permutation schemes. 30 parameter settings, with 5 replicates for each, adopted varying housekeeping gene percentages (95%, 80%, 60%, 40%, or 25%), different prior degrees of freedom ν0 (5 or 40), and σ0 values (0.2, 0.5, or 0.8). (b) Histograms of p value distributions associated with the five SDEG detection methods, where simulation data consisted of 60% housekeeping genes, σ0=0.5 and ν0=40. Note that subtype-specific p values can be higher than 1.0 after multiple testing correction and thus will be truncated (indicated by the blue circle; see Supplementary Information for details).
Figure 3
Figure 3
Assessment on Type 1 error rates using simulation data sets involving five subtypes. The results are obtained using the p value cutoff at 0.05 over 150 simulation experiments. 10,000 non-SDEGs are simulated with 30 parameter settings and 5 replicates for each. (a) Bar chart of the mean and 95% confidence interval of type I error rates with unbalanced sample sizes. A portion of housekeeping genes take the baseline expression levels across all the five subtypes. The remaining non-SDEGs are adjusted to exhibiting similar upregulations in at least two subtypes. The sample size for subtype S1–S5 is n1 = 3, n2 = 6, n3 = 9, n4 = 12 and n5 = 15, respectively. (b) Bar chart of the mean and 95% confidence interval of type I error rates with unbalanced compositions of mixture null distribution. Five subtype-specifc profiles are derived from the real gene expression data of two cell lines,where two subtypes are associated with one cell line and other three subtypes are associated with another cell line, making two subtypes closer to each other and other three close to each other. The data are under the null hypothesis thus no SDEGs exist in any of the five subtypes. The sample size is 3 for each subtype.
Figure 4
Figure 4
Comparative assessment on detection power (partial ROC curves, FPR < 0.01) using simulations produced from real gene expression data (non-SDEG pattern distribution is consistent with the baseline real dataset under null hypothesis; variances are sampled from real microarray data GSE28490 or RNAseq data GSE60424 with keeping mean–variance trend) involving seven unbalanced subtypes with various parameter settings. SDEGs are adjusted to exhibiting upregulations with varying fold changes sampled from [5, 20] or [10, 20]. (a) and (b) Partial ROC curves across different FPR points on microarray-derived data. (c) and (d) partial ROC curves across different FPR points on RNAseq-derived data. (OVR-FC and OVR t-test are not shown here due to low pAUC; subtype-specific OVE-sFC test’s performance is quite similar to OVE-sFC test; more complete ROC curves can be found in Figure S7; more fold change settings can be found in Figure S8). Both OVE-FC and OVE-sFC achieve a better performance than the other methods. OVE-FC achieve the best performance for ideal SDEGs with extremely large fold change, and OVE-sFC would be the preferred choice for SDEGs with small fold change or low SNR (RNASeq data is noiser than microarray data).
Figure 5
Figure 5
Percentile overlap of cell-type specific SDEGs between Roche and HUG datasets, quantified by Jaccard index (intersection over union). SDEGs are detected by subtype-specific OVE-sFC test with q-value < 0.05.
Figure 6
Figure 6
Correlation coefficients between CAM scores and ground truth proportions in simulated heterogeneous samples of mixed subtype mRNA expression profiles or RNAseq counts (ac based on three different real gene expression datasets). CAM scores are estimated using the detected SDEGs from independent dataset and reflect the proportions of subtypes (Supplementary Information). The mean and 95% confidence interval are computed over 20 repeated experiments (OVR t-test results are not shown in (c) due to very poor performance).
Figure 7
Figure 7
Mixture null distribution of OVE-sFC test statistic for detecting SDEGs. The mixture distribution consists of (K-1) null components, each estimated from the resamples after randomly permuting samples in the top K-m subtypes of high mean expressions and weighted by the posterior probabilities of component null hypotheses.

Similar articles

Cited by

References

    1. Wang Y, Lu J, Lee R, Gu Z, Clarke R. Iterative normalization of cDNA microarray data. IEEE Trans. Inf. Technol. Biomed. 2002;6:29–37. doi: 10.1109/4233.992159. - DOI - PubMed
    1. Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS ONE. 2009;4:e6098. doi: 10.1371/journal.pone.0006098. - DOI - PMC - PubMed
    1. Montano C, et al. Measuring cell-type specific differential methylation in human brain tissue. Genome Biol. 2013;14:R94. doi: 10.1186/gb-2013-14-8-r94. - DOI - PMC - PubMed
    1. Shen-Orr SS, et al. Cell type-specific gene expression differences in complex tissues. Nat. Methods. 2010;7:287–289. doi: 10.1038/nmeth.1439. - DOI - PMC - PubMed
    1. Yu G, et al. Matched gene selection and committee classifier for molecular classification of heterogeneous diseases. J. Mach. Learn. Res. 2010;11:2141–2167.

Publication types

MeSH terms

LinkOut - more resources