Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 11;111(7):1282-1300.
doi: 10.1016/j.ajhg.2024.05.005. Epub 2024 Jun 3.

Impact of genome build on RNA-seq interpretation and diagnostics

Affiliations

Impact of genome build on RNA-seq interpretation and diagnostics

Rachel A Ungar et al. Am J Hum Genet. .

Abstract

Transcriptomics is a powerful tool for unraveling the molecular effects of genetic variants and disease diagnosis. Prior studies have demonstrated that choice of genome build impacts variant interpretation and diagnostic yield for genomic analyses. To identify the extent genome build also impacts transcriptomics analyses, we studied the effect of the hg19, hg38, and CHM13 genome builds on expression quantification and outlier detection in 386 rare disease and familial control samples from both the Undiagnosed Diseases Network and Genomics Research to Elucidate the Genetics of Rare Disease Consortium. Across six routinely collected biospecimens, 61% of quantified genes were not influenced by genome build. However, we identified 1,492 genes with build-dependent quantification, 3,377 genes with build-exclusive expression, and 9,077 genes with annotation-specific expression across six routinely collected biospecimens, including 566 clinically relevant and 512 known OMIM genes. Further, we demonstrate that between builds for a given gene, a larger difference in quantification is well correlated with a larger change in expression outlier calling. Combined, we provide a database of genes impacted by build choice and recommend that transcriptomics-guided analyses and diagnoses are cross referenced with these data for robustness.

Keywords: RNA-seq; genome build; rare disease.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests During this project R.A.U. was employed for an internship by Vertex Pharmaceuticals. P.C.G. is a consultant for BioMarin. S.B.M. is an advisor to BioMarin, MyOme, and Tenaya Therapeutics.

Figures

Figure 1
Figure 1
Study overview (A) Description of cohort, including the primary diagnosis types for all probands (below) and the number of samples assayed per tissue type (right). (B) Overview of the methodology. (C) Bar charts displaying the total number of build-dependent events identified in the hg19 vs. hg38 and hg38 vs. chm13 comparisons. The total number of genes in each group are highlighted above each bar chart. The proportion of genes that are linked to disease in public databases is highlighted in darker color, and that number is indicated in parentheses with an asterisk.
Figure 2
Figure 2
Annotation comparison identifies annotation-specific genes with detected expression (A) Sankey diagram summarizing the number of genes that differ between the hg38 GENCODEv35 annotation and GENCODEv35lift37 for hg19 (left) and UCSC GENCODEv35 CAT/Liftoff v2 annotation for chm13 (right). (B) Definition of annotation-specific expression events. (C) Stacked bar plots indicating how many annotation-specific genes with detected expression in at least one tissue overlap known issues in the corresponding reference genomes. Left-hand facet shows hg19-specific and hg38-specific genes from the hg19:hg38 comparison; the right-hand facet shows the hg38-specific and chm13-specific genes from the hg38:chm13 comparison. Sets containing at least 25 genes are labeled with the fraction of total expressed genes they represent within the given build and comparison. Colors represent presence (grays) or absence (light blue) of documented exclusion regions or issues. (D) Sankey diagram illustrating how the same RNA-seq reads are aligned to the SIK1/SIK1B locus in hg38 (left) and chm13 (right) across all samples. Percentages are based on the total number of reads aligned to SIK1 or SIK1B in either build.
Figure 3
Figure 3
Hundreds of genes are significantly and substantially differentially quantified between builds (A and D) The number of genes that were significantly differentially quantified by build (adjusted p value <0.05 and abs(logFC) > 1) between hg19 and hg38 and (D) between hg38 and chm13 across tissue types. Gray bars on the far right display the union of differentially quantified genes across all tissues. Inset in (A) provides a visual definition of differentially quantified events in which a gene is annotated and sufficiently quantified in both alignments the expression estimates differ. (B and E) Distribution of logFC values for significant genes (adjusted p value <0.05) across tissues for hg19 compared to hg38 and (E) hg38 compared to chm13. (C and F) Upset plot displaying the putative reasons underlying differences in gene expression estimates between hg19 and hg38 and (F) hg38 and chm13.
Figure 4
Figure 4
Hundreds of mutually annotated genes show substantial build-exclusive expression (A) Depiction of build-exclusive expression. (B) Distribution of median TPM levels of build-exclusive genes on a log scale for the hg19:hg38 comparison (left) and hg38:chm13 comparison (right). The number of build-exclusive genes detected are labeled underneath.
Figure 5
Figure 5
Impact of build selection on expression and splicing outlier detection (A) Boxplots displaying the number of over expression, under expression, and splicing outlier genes per sample detected from data aligned to hg19 (red), hg38 (yellow), and CHM13 (blue). (B) Expression outlier consistency between hg19:hg38 (left) and hg38:chm13 (right). In orange are the outliers that are consistent between hg19 and hg38, and in dark green are the number of outliers consistent between hg38 and chm13. In lighter shades are the number of outliers with a Z score greater than 3 in chm13 but less than 3 in hg38 (or greater than 3 in hg38 but less than 3 in hg19) and so forth. The lightest shades are outlier in the reference build (ex chm13) but are not in the comparison build (ex hg38) due to lack of quantification in that build. This is faceted by tissue type, and expression vs. splicing outliers. (C) Comparison between differential quantification fold change and average absolute Z score change. Each data point is a gene, the x axis represents the absolute log fold change in the differential quantification results, and the y axis dictates the average change in Z score between builds for that gene. This is plotted for hg19:hg38 (left) and hg38:chm13 (right), and the color of the point is determined by tissue. This is significantly correlated for all groups (<2.2e-16), with the hg19:hg38 R2 is 0.63 and 0.55 for blood and fibroblast, respectively, and the hg38:chm13 R2 for is 0.58 and 0.64 for blood and fibroblast.
Figure 6
Figure 6
Build selection impacts transcriptome-guided gene prioritization (A) Comparison of the ranked Z scores for genes in the top-20 expression outlier lists from both the hg19 and hg38 alignments across all affected individuals (Pearson correlation R2 = 0.97). (B) The distribution of Z score ranks for genes that were only in an affected individual’s top-20 list in one build for the hg19:hg38 comparison. (C) Ranked Z scores for genes in the top-20 expression outlier lists from both the hg38 and chm13 alignments across all affected individuals (Pearson correlation for Z score ranks in both top-20 lists R2 = 0.78). (D) The distribution of Z score ranks for genes that were only in an affected individual’s top-20 list in one build for the hg38:chm13 comparison. (E) Diagnostic gene outlier ranks among the top 250 phenotype-prioritized genes across 44 samples from 36 individuals with rare disease with under expression based on hg19 alignment (x axis) and hg38 alignment (y axis). (F) Diagnostic gene outlier ranks among the top 250 phenotype-prioritized genes across 44 samples from 36 individuals with rare disease with under expression based on hg38 alignment (x axis) and chm13 alignment (y axis). The top-five gene-sample pairs with the most extreme residuals are highlighted.

Update of

Similar articles

References

    1. Montgomery S.B., Bernstein J.A., Wheeler M.T. TOWARDS TRANSCRIPTOMICS AS A PRIMARY TOOL FOR RARE DISEASE INVESTIGATION. Mol. Case Stud. 2022;8 doi: 10.1101/mcs.a006198. - DOI - PMC - PubMed
    1. Frankish A., Uszczynska B., Ritchie G.R.S., Gonzalez J.M., Pervouchine D., Petryszak R., Mudge J.M., Fonseca N., Brazma A., Guigo R., Harrow J. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genom. 2015;16:S2. doi: 10.1186/1471-2164-16-S8-S2. - DOI - PMC - PubMed
    1. Wu P.-Y., Phan J.H., Wang M.D. Assessing the impact of human genome annotation choice on RNA-seq expression estimates. BMC Bioinf. 2013;14 doi: 10.1186/1471-2105-14-S11-S8. - DOI - PMC - PubMed
    1. Chisanga D., Liao Y., Shi W. Impact of gene annotation choice on the quantification of RNA-seq data. BMC Bioinf. 2022;23:107. doi: 10.1186/s12859-022-04644-8. - DOI - PMC - PubMed
    1. Zhao S., Zhang B. A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genom. 2015;16:97. doi: 10.1186/s12864-015-1308-8. - DOI - PMC - PubMed

LinkOut - more resources