Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies

doi:10.1186/s12864-020-6502-7

. 2020 Jan 28;21(1):75.

doi: 10.1186/s12864-020-6502-7.

Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies

Xiaohong Li¹, Nigel G F Cooper², Timothy E O'Toole³, Eric C Rouchka⁴

Affiliations

¹ Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, USA. x0li0013@louisville.edu.
² Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, USA.
³ Envirome Institute, University of Louisville, Louisville, KY, USA.
⁴ Department of Computer Science and Engineering, University of Louisville, Louisville, KY, USA.

PMID: 31992223
PMCID: PMC6986029
DOI: 10.1186/s12864-020-6502-7

Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies

Xiaohong Li et al. BMC Genomics. 2020.

. 2020 Jan 28;21(1):75.

doi: 10.1186/s12864-020-6502-7.

Authors

Xiaohong Li¹, Nigel G F Cooper², Timothy E O'Toole³, Eric C Rouchka⁴

Affiliations

¹ Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, USA. x0li0013@louisville.edu.
² Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, USA.
³ Envirome Institute, University of Louisville, Louisville, KY, USA.
⁴ Department of Computer Science and Engineering, University of Louisville, Louisville, KY, USA.

PMID: 31992223
PMCID: PMC6986029
DOI: 10.1186/s12864-020-6502-7

Abstract

Background: High-throughput RNA sequencing (RNA-seq) has evolved as an important analytical tool in molecular biology. Although the utility and importance of this technique have grown, uncertainties regarding the proper analysis of RNA-seq data remain. Of primary concern, there is no consensus regarding which normalization and statistical methods are the most appropriate for analyzing this data. The lack of standardized analytical methods leads to uncertainties in data interpretation and study reproducibility, especially with studies reporting high false discovery rates. In this study, we compared a recently developed normalization method, UQ-pgQ2, with three of the most frequently used alternatives including RLE (relative log estimate), TMM (Trimmed-mean M values) and UQ (upper quartile normalization) in the analysis of RNA-seq data. We evaluated the performance of these methods for gene-level differential expression analysis by considering the factors, including: 1) normalization combined with the choice of a Wald test from DESeq2 and an exact test/QL (Quasi-likelihood) F-Test from edgeR; 2) sample sizes in two balanced two-group comparisons; and 3) sequencing read depths.

Results: Using the MAQC RNA-seq datasets with small sample replicates, we found that UQ-pgQ2 normalization combined with an exact test can achieve better performance in term of power and specificity in differential gene expression analysis. However, using an intra-group analysis of false positives from real and simulated data, we found that a Wald test performs better than an exact test when the number of sample replicates is large and that a QL F-test performs the best given sample sizes of 5, 10 and 15 for any normalization. The RLE, TMM and UQ methods performed similarly given a desired sample size.

Conclusion: We found the UQ-pgQ2 method combined with an exact test/QL F-test is the best choice in order to control false positives when the sample size is small. When the sample size is large, UQ-pgQ2 with a QL F-test is a better choice for the type I error control in an intra-group analysis. We observed read depths have a minimal impact for differential gene expression analysis based on the simulated data.

Keywords: Differentially expressed genes; Normalization; RNA-seq; Sample sizes; Statistical test.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
False positive rates estimated via intra-group analysis of AdLC, OC and TNBC data. Illustrated are the fractions of FPs estimated from the RLE (pink), TMM (green), UQ (blue) and UQ-pgQ2 (purple) normalization using the exact test, QL F-test or Wald test for sample sizes of 5, 10, 15, 20, 25, 30, 35 and 40. The plots are based on AdLC data (a-c), OC data (d-f) and TNBC data (g-i)

**Fig. 2**
False positive rates estimated via intra-group analysis of 379 OC samples with different read depths. Illustrated are the fractions of FPs estimated from the RLE, TMM, UQ and UQ-pgQ2 normalization with the exact test, QL F-test or Wald test for sample sizes of 5, 10, 15, 20, 25, 30, 35 and 40. The plots are based on a read depth from 19 to 157 million (a-c), from 12.8 to 104 million (d-f), and from 9.6 to 78.6 million (g-i)

**Fig. 3**
False positive rates estimated from three test statistics via intra-group analysis of OC samples. Illustrated are the fractions of FPs estimated from the exact test/QL F-test and Wald test combined with the RLE, TMM and UQ-pgQ2 normalization methods, based on the read depths from 19 to 157 million (a-c), from 12.8 to 104 million (d-f) and from 9.6 to 78.6 million (g-i)

**Fig. 4**
False positive rates estimated from simulated data with variable read depths given fixed sample sizes. Illustrated are the fractions of FPs estimated from the RLE, TMM, UQ and UQ-pgQ2 normalization methods using the exact test, QL F-test or Wald test for sample sizes of 5, 10, 15, 20, 25, 30, 35 and 40. Plots are based on 122 simulated data with the mean read depth of 30 million and SD of 3 (a-c) or 5 million (d-f); a mean read depth of 40 million and SD of 3 (g-i) or 5 million (j-l); a mean read depth of 50 million reads and SD of 3 (m-o) or 5 million (p-r)

**Fig. 5**
DEGs identified from the four normalization methods for a balanced two-groups comparison. Illustrated are the number of DEGs identified in 117 TNBC and 112 normal control samples (a, c, e), and 535 AdLC and 59 normal control samples (b, d, f). Analysis was accomplished using the exact test (a, b), QL F-test (c, d) and Wald test (e, f) listed

See this image and copyright information in PMC

Cited by

Integrated analysis reveals the regulatory mechanism of the neddylation inhibitor MLN4924 on the metabolic dysregulation in rabbit granulosa cells.
Chen M, Liu Y, Zuo M, Zhang M, Wang Z, Li X, Yuan D, Xu H, Yu G, Li M. Chen M, et al. BMC Genomics. 2024 Mar 6;25(1):254. doi: 10.1186/s12864-024-10118-3. BMC Genomics. 2024. PMID: 38448814 Free PMC article.
Genome scale CRISPR screens identify actin capping proteins as key modulators of therapeutic responses to radiation and immunotherapy.
Verma N, Renauer PA, Dong C, Xin S, Lin Q, Zhang F, Glazer PM, Chen S. Verma N, et al. bioRxiv [Preprint]. 2024 Jan 15:2024.01.14.575614. doi: 10.1101/2024.01.14.575614. bioRxiv. 2024. PMID: 38293095 Free PMC article. Preprint.
WASF3 overexpression affects the expression of circular RNA hsa-circ-0100153, which promotes breast cancer progression by sponging hsa-miR-31, hsa-miR-767-3p, and hsa-miR-935.
Mokhtari M, Khoshbakht S, Esmaeil Akbari M, Sayyed Sajjad M. Mokhtari M, et al. Heliyon. 2023 Nov 29;9(12):e22874. doi: 10.1016/j.heliyon.2023.e22874. eCollection 2023 Dec. Heliyon. 2023. PMID: 38125536 Free PMC article.
Validation of potential RNA biomarkers for prostate cancer diagnosis and monitoring in plasma and urinary extracellular vesicles.
Brokāne A, Bajo-Santos C, Zayakin P, Belovs A, Jansons J, Lietuvietis V, Martens-Uzunova ES, Jenster GW, Linē A. Brokāne A, et al. Front Mol Biosci. 2023 Nov 30;10:1279854. doi: 10.3389/fmolb.2023.1279854. eCollection 2023. Front Mol Biosci. 2023. PMID: 38099195 Free PMC article.
Unveiling metabolic integration in psyllids and their nutritional endosymbionts through comparative transcriptomics analysis.
Kwak Y, Hansen AK. Kwak Y, et al. iScience. 2023 Sep 15;26(10):107930. doi: 10.1016/j.isci.2023.107930. eCollection 2023 Oct 20. iScience. 2023. PMID: 37810228 Free PMC article.

See all "Cited by" articles

References

1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
1. Marguerat S, Bahler J. RNA-seq: from technology to biology. Cell Mol Life Sci. 2010;67(4):569–579. doi: 10.1007/s00018-009-0180-6. - DOI - PMC - PubMed
1. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, et al. De novo assembly and analysis of RNA-seq data. Nat Methods. 2010;7(11):909–912. doi: 10.1038/nmeth.1517. - DOI - PubMed
1. Schliebner I, Becher R, Hempel M, Deising HB, Horbach R. New gene models and alternative splicing in the maize pathogen Colletotrichum graminicola revealed by RNA-Seq analysis. BMC Genomics. 2014;15(1):842. doi: 10.1186/1471-2164-15-842. - DOI - PMC - PubMed
1. Canovas A, Rincon G, Islas-Trejo A, Wickramasinghe S, Medrano JF. SNP discovery in the bovine milk transcriptome using RNA-Seq technology. Mamm Genome. 2010;21(11–12):592–598. doi: 10.1007/s00335-010-9297-z. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

P20GM103436/Foundation for the National Institutes of Health

LinkOut - more resources

Full Text Sources

[1] Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed

[2] Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed

[3] Marguerat S, Bahler J. RNA-seq: from technology to biology. Cell Mol Life Sci. 2010;67(4):569–579. doi: 10.1007/s00018-009-0180-6. - DOI - PMC - PubMed

[4] Marguerat S, Bahler J. RNA-seq: from technology to biology. Cell Mol Life Sci. 2010;67(4):569–579. doi: 10.1007/s00018-009-0180-6. - DOI - PMC - PubMed

[5] Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, et al. De novo assembly and analysis of RNA-seq data. Nat Methods. 2010;7(11):909–912. doi: 10.1038/nmeth.1517. - DOI - PubMed

[6] Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, et al. De novo assembly and analysis of RNA-seq data. Nat Methods. 2010;7(11):909–912. doi: 10.1038/nmeth.1517. - DOI - PubMed

[7] Schliebner I, Becher R, Hempel M, Deising HB, Horbach R. New gene models and alternative splicing in the maize pathogen Colletotrichum graminicola revealed by RNA-Seq analysis. BMC Genomics. 2014;15(1):842. doi: 10.1186/1471-2164-15-842. - DOI - PMC - PubMed

[8] Schliebner I, Becher R, Hempel M, Deising HB, Horbach R. New gene models and alternative splicing in the maize pathogen Colletotrichum graminicola revealed by RNA-Seq analysis. BMC Genomics. 2014;15(1):842. doi: 10.1186/1471-2164-15-842. - DOI - PMC - PubMed

[9] Canovas A, Rincon G, Islas-Trejo A, Wickramasinghe S, Medrano JF. SNP discovery in the bovine milk transcriptome using RNA-Seq technology. Mamm Genome. 2010;21(11–12):592–598. doi: 10.1007/s00335-010-9297-z. - DOI - PMC - PubMed

[10] Canovas A, Rincon G, Islas-Trejo A, Wickramasinghe S, Medrano JF. SNP discovery in the bovine milk transcriptome using RNA-Seq technology. Mamm Genome. 2010;21(11–12):592–598. doi: 10.1007/s00335-010-9297-z. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies

Affiliations

Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources