Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2017 May 1;12(5):e0176185.
doi: 10.1371/journal.pone.0176185. eCollection 2017.

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data

Affiliations
Comparative Study

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data

Xiaohong Li et al. PLoS One. .

Abstract

Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (<70%) and a slightly higher actual FDR than our proposed methods. In addition, the results from an analysis based on the qualitative characteristics of sample distribution for MAQC2 and human breast cancer datasets show that only our gene-wise normalization methods corrected data skewed towards lower read counts. However, when we evaluated MAQC3 with less variation in five replicates, all methods performed similarly. Thus, our proposed Med-pgQ2 and UQ-pgQ2 methods perform slightly better for differential gene analysis of RNA-seq data skewed towards lowly expressed read counts with high variation by improving specificity while maintaining a good detection power with a control of the nominal FDR level.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Comparison of nine normalization methods.
(A) Illustrated are boxplots of log2 (counts+1) for MAQC data with two replicates in two conditions (uhr and hbr). The samples in hbr and uhr conditions are in green and red, respectively. Med-pgQ2 and UQ-pgQ2 are our proposed methods. (B) Illustrated are boxplots of the intra-condition coefficient of variation (uhr and hbr), respectively.
Fig 2
Fig 2. RMSD (root-mean-square deviation) between the log2 expression fold changes of MAQC2 and qRT-PCR.
Illustrated is the RMSD between the log2 fold changes computed from DEGs based on different methods and the values computed from qRT-PCR. FPKM (yellow) has the least similarity while DESeq normalization (brown) has the highest one.
Fig 3
Fig 3. ROC curve and AUC values from MAQC2 data.
The ROC curves and AUC values (inset) for evaluating the performance of the nine normalization methods were computed using MAQC2 with two conditions (uhr and hbr). Our proposed methods, Med-pgQ2 and UQ-pgQ2 (blue and grey, respectively) performed slightly better.
Fig 4
Fig 4. ROC curve and AUC values from the simulated data at a fold-change of 1.5 and 2.
Illustrated are the ROC curves for detecting 1, 500 DEGs (750 up and 750 dow-regulated) using a fold change = 1.5 (A) and a fold change = 2 (B) with an unequal library size. Calculated AUC values are in the inset. The simulated data, containing a total of 15,000 genes in two conditions and 10 replicates per condition, was used for evaluating the performance of eight normalization methods. Our methods (UQ-pgQ2 and Med-pgQ2) are in cyan and blue, respectively.
Fig 5
Fig 5. ROC curve and AUC values from the simulated data with 4 and 6 replicates in each condition.
Illustrated are the ROC curves and AUC values (inset) in analyzing the impact of biological replicates on the performance of normalization methods. We used the simulated data with four biological replicates (A) and six biological replicates (B), which contain 1,500 DEGs with 2 FC difference between two conditions. Our methods (UQ-pgQ2 and Med-pgQ2) are in cyan and blue, respectively.

Similar articles

Cited by

References

    1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456: 470–476. doi: 10.1038/nature07509 - DOI - PMC - PubMed
    1. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40: 1413–1415. doi: 10.1038/ng.259 - DOI - PubMed
    1. Schliebner I, Becher R, Hempel M, Deising HB, Horbach R (2014) New gene models and alternative splicing in the maize pathogen Colletotrichum graminicola revealed by RNA-Seq analysis. BMC Genomics 15: 842 doi: 10.1186/1471-2164-15-842 - DOI - PMC - PubMed
    1. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, et al. (2010) De novo assembly and analysis of RNA-seq data. Nat Methods 7: 909–912. doi: 10.1038/nmeth.1517 - DOI - PubMed
    1. Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28: 1086–1092. doi: 10.1093/bioinformatics/bts094 - DOI - PMC - PubMed

Publication types