Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 14;19(1):423.
doi: 10.1186/s12859-018-2445-2.

Empirical assessment of the impact of sample number and read depth on RNA-Seq analysis workflow performance

Affiliations

Empirical assessment of the impact of sample number and read depth on RNA-Seq analysis workflow performance

Alyssa Baccarella et al. BMC Bioinformatics. .

Abstract

Background: RNA-Sequencing analysis methods are rapidly evolving, and the tool choice for each step of one common workflow, differential expression analysis, which includes read alignment, expression modeling, and differentially expressed gene identification, has a dramatic impact on performance characteristics. Although a number of workflows are emerging as high performers that are robust to diverse input types, the relative performance characteristics of these workflows when either read depth or sample number is limited-a common occurrence in real-world practice-remain unexplored.

Results: Here, we evaluate the impact of varying read depth and sample number on the performance of differential gene expression identification workflows, as measured by precision, or the fraction of genes correctly identified as differentially expressed, and by recall, or the fraction of differentially expressed genes identified. We focus our analysis on 30 high-performing workflows, systematically varying the read depth and number of biological replicates of patient monocyte samples provided as input. We find that, in general for most workflows, read depth has little effect on workflow performance when held above two million reads per sample, with reduced workflow performance below this threshold. The greatest impact of decreased sample number is seen below seven samples per group, when more heterogeneity in workflow performance is observed. The choice of differential expression identification tool, in particular, has a large impact on the response to limited inputs.

Conclusions: Among the tested workflows, the recall/precision balance remains relatively stable at a range of read depths and sample numbers, although some workflows are more sensitive to input restriction. At ranges typically recommended for biological studies, performance is more greatly impacted by the number of biological replicates than by read depth. Caution should be used when selecting analysis workflows and interpreting results from low sample number experiments, as all workflows exhibit poorer performance at lower sample numbers near typically reported values, with variable impact on recall versus precision. These analyses highlight the performance characteristics of common differential gene expression workflows at varying read depths and sample numbers, and provide empirical guidance in experimental and analytical design.

Keywords: Gene expression analysis; Monocytes; RNA-sequencing; Read depth; Sample number.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Analysis workflow steps' impact on performance. Precision and recall for each iteration, separated by read aligner and expression modeler (rows) and differential gene tool (columns). Colors represent sample number and shapes represent read depths
Fig. 2
Fig. 2
Read depth's impact on performance. Precision and recall, averaged over the 10 iterations at a given sample number and read depth, split by sample number (columns) and read depth (rows). Values for each workflow (read aligner, expression modeler, and differential expression tool) are averaged and displayed separately. Points represent mean; bars represent standard deviation; colors represent differential expression tool. Red solid line represents linear regression line for plotted data. R2 value corresponds to plotted data. Gray dashed line represents linear regression fit of the first row of data for each column, superimposed over subsequent rows for comparison
Fig. 3
Fig. 3
Sample number's impact on performance. Precision and recall, averaged over the 10 iterations at a given sample number and read depth, split by read depth (columns) and sample number (rows). Values for each workflow (read aligner, expression modeler, and differential expression tool) are averaged and displayed separately. Points represent mean; bars represent standard deviation; colors represent differential expression tool. Red solid line represents linear regression line for plotted data. R2 value corresponds to plotted data. Gray dashed line represents linear regression fit of the first row of data for each column, superimposed over subsequent rows for comparison
Fig. 4
Fig. 4
Literature survey of RNA-Seq experiment sample numbers. Violin plots of sample numbers used in 200 studies containing RNA-Seq differential gene expression analysis, either from all species (a) or limited to primary human samples (b). Individual dots represent average sample number used in each study. Grey dashed line represents six samples
Fig. 5
Fig. 5
Significant gene number's impact on performance. Average recall (a) or average precision (b) versus the average number of genes identified as significant. Panels are split by read depths, with 2 × 107, 1 × 107, 5 × 106, and 2 × 106 reads plotted as high read depths, 1 × 106, 5 × 105 and 3 × 105 plotted as medium read depths, and 1 × 105, 5 × 105, and 3 × 104 plotted as low read depths. Dots represent values for individual workflows (read aligner, expression modeler, and differential expression tool) at a given sample number and read depth, averaged over the ten sample combination iterations run at each given sample number and read depth. Bars represent standard deviation. Colors represent sample number. Red line represents linear regression for plotted data. R2 value corresponds to plotted data

Similar articles

Cited by

References

    1. Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol. 2010;11:220. doi: 10.1186/gb-2010-11-12-220. - DOI - PMC - PubMed
    1. Fonseca NA, Marioni J, Brazma A, Salzberg S, Zaleski C. RNA-Seq Gene Profiling - A Systematic Empirical Comparison. Provart NJ, editor. PLoS One. Public Libr Sci; 2014;9:e107026. - PMC - PubMed
    1. Nookaew I, Papini M, Pornputtapong N, Scalcinati G, Fagerberg L, Uhlén M, et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012;40:10084–10097. doi: 10.1093/nar/gks804. - DOI - PMC - PubMed
    1. Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform. 2015;16:59–70. doi: 10.1093/bib/bbt086. - DOI - PMC - PubMed
    1. Teng M, Love MI, Davis CA, Djebali S, Dobin A, Graveley BR, et al. A benchmark for RNA-seq quantification pipelines. Genome Biol. 2016;17:74. doi: 10.1186/s13059-016-0940-1. - DOI - PMC - PubMed

LinkOut - more resources