Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;14 Suppl 11(Suppl 11):S8.
doi: 10.1186/1471-2105-14-S11-S8. Epub 2013 Nov 4.

Assessing the impact of human genome annotation choice on RNA-seq expression estimates

Assessing the impact of human genome annotation choice on RNA-seq expression estimates

Po-Yen Wu et al. BMC Bioinformatics. 2013.

Abstract

Background: Genome annotation is a crucial component of RNA-seq data analysis. Much effort has been devoted to producing an accurate and rational annotation of the human genome. An annotated genome provides a comprehensive catalogue of genomic functional elements. Currently, at least six human genome annotations are publicly available, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Characteristics of these annotations differ because of variations in annotation strategies and information sources. When performing RNA-seq data analysis, researchers need to choose a genome annotation. However, the effect of genome annotation choice on downstream RNA-seq expression estimates is still unclear. This study (1) investigates the effect of different genome annotations on RNA-seq quantification and (2) provides guidelines for choosing a genome annotation based on research focus.

Results: We define the complexity of human genome annotations in terms of the number of genes, isoforms, and exons. This definition facilitates an investigation of potential relationships between complexity and variations in RNA-seq quantification. We apply several evaluation metrics to demonstrate the impact of genome annotation choice on RNA-seq expression estimates. In the mapping stage, the least complex genome annotation, RefSeq Genes, appears to have the highest percentage of uniquely mapped short sequence reads. In the quantification stage, RefSeq Genes results in the most stable expression estimates in terms of the average coefficient of variation over all genes. Stable expression estimates in the quantification stage translate to accurate statistics for detecting differentially expressed genes. We observe that RefSeq Genes produces the most accurate fold-change measures with respect to a ground truth of RT-qPCR gene expression estimates.

Conclusions: Based on the observed variations in the mapping, quantification, and differential expression calling stages, we demonstrate that the selection of human genome annotation results in different gene expression estimates. When conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation may be preferred. However, simpler genome annotations may limit opportunities for identifying or characterizing novel transcriptional or regulatory mechanisms. When conducting research that aims to be more exploratory, a more complex genome annotation may be preferred.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Annotated percentage per chromosome of each genome annotation. For each genome annotation, the annotated percentage of each chromosome is demonstrated on (a) the gene level, (b) the exon level, and (c) the coding sequence level. The AceView annotation usually has the highest annotated percentage for all chromosomes and all levels of comparison.
Figure 2
Figure 2
Distribution of read mapping categories. (a) MAQC samples (SRA: SRP000727) contain single-end reads, thus, there are three read mapping categories: uniquely mapped reads, non-uniquely mapped reads, and unmapped reads (b) Thrombin study samples (SRA: SRP008482) contain paired-end reads, thus, five read mapping categories can possibly occur. Cases of uniquely paired reads and non-uniquely paired reads occur when both ends of a read pair are mappable to the genome. Situations of uniquely mapped singletons and non-uniquely mapped singletons occur when only one end of a read pair is mappable to the genome. The RefSeq annotation has the highest percentage of uniquely mapped reads and the lowest non-uniquely mapped reads for both samples.
Figure 3
Figure 3
The percentage of reads or read pairs that map to the annotated and un-annotated genomic sequences. Sub-figures (a)-(d) represent different combinations of samples (top: MAQC samples with SRA accession number SRP000727; bottom: thrombin study samples with SRA accession number SRP008482) and spliced mappers (left: OSA; right: TopHat). The UCSC annotation usually has the lowest percentage of reads that mapped to the annotated genomic sequences, while the AceView annotation usually has the highest percentage. The same observation is applicable to all four combinations of samples and spliced mappers.
Figure 4
Figure 4
The average coefficient of variation varies with different annotations and gene or isoform subgroups. Sub-figures (a)-(d) represent different combinations of samples (top: MAQC samples with SRA accession number SRP000727; bottom: thrombin study samples with SRA accession number SRP008482) and expression estimates (left: TPM estimates from OSA package; right: FPKM estimates from TopHat alignment with Cufflinks quantification). The RefSeq annotation always has the smallest average coefficient of variation, while the AceView annotation has the highest average coefficient of variation for most of the cases. The variation is small when focusing on only common genes.
Figure 5
Figure 5
The percentage of present genomic elements varies with different annotations and gene or isoform subgroups. Sub-figures (a)-(d) represent different combinations of samples (top: MAQC samples with SRA accession number SRP000727; bottom: thrombin study samples with SRA accession number SRP008482) and expression estimates (left: TPM estimates from OSA package; right: FPKM estimates from TopHat alignment with Cufflinks quantification). The RefSeq annotation usually has the highest percentage of present genomic elements, while the Ensembl or Vega annotation generally has the lowest percentage of present genomic elements. The variation is small when focusing on only common genes.
Figure 6
Figure 6
Statistics for comparing RNA-seq-based fold-changes and RT-qPCR-based fold-changes vary with different annotations. The comparison of fold-change estimates between RNA-seq and RT-qPCR using two RNA-seq expression estimates and three statistics. (a) TPM estimates are produced by the OSA package. (b) FPKM estimates are generated by Cufflinks with TopHat alignment. The RefSeq annotation always has the lowest (1) average absolute deviation and (2) root mean square error, and the highest correlation coefficient when treating RT-qPCR estimates as the ground truth.
Figure 7
Figure 7
Workflow of RNA-seq data analysis pipeline. The five dashed boxes correspond to five steps in the RNA-seq data analysis pipeline. We applied two alignment tools and two quantification tools to estimate gene/isoform expression with normalization methods of count, TPM, or FPKM. The fold-change method and the edgeR tool are used to infer differentially expressed genes. At each analysis step, we assess the variations resulting from genome annotation choice.
Figure 8
Figure 8
Workflow of typical RNA-seq spliced alignment pipeline. A spliced mapper aligns RNA-seq data to a genome with the help of a genome annotation. Different genome annotations define various sets of exon junction information which affect the output of spliced mappers. The mapped reads can then be used to quantify gene/isoform expressions.

Similar articles

Cited by

References

    1. Liu N, Olson EN. MicroRNA regulatory networks in cardiovascular development. Developmental cell. 2010;18(4):510–525. doi: 10.1016/j.devcel.2010.03.010. - DOI - PMC - PubMed
    1. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed
    1. Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nature reviews Genetics. 2009;10(10):669–680. doi: 10.1038/nrg2641. - DOI - PMC - PubMed
    1. Morozova O, Marra MA. Applications of next-generation sequencing technologies in functional genomics. Genomics. 2008;92(5):255–264. doi: 10.1016/j.ygeno.2008.07.001. - DOI - PubMed
    1. Li H, Zhou H, Wang D, Qiu J, Zhou Y, Li X, Rosenfeld MG, Ding S, Fu XD. Versatile pathway-centric approach based on high-throughput sequencing to anticancer drug discovery. Proceedings of the National Academy of Sciences of the United States of America. 2012;109(12):4609–4614. doi: 10.1073/pnas.1200305109. - DOI - PMC - PubMed

Publication types