Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May 27:5:10576.
doi: 10.1038/srep10576.

A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data

Affiliations

A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data

Qiongshi Lu et al. Sci Rep. .

Abstract

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Modeling of causal relationship among variables. (a) We adopt the biologically meaningful definition of function, and treat conservation measures and biochemical signals as consequences. (b) The latent functional indicator Z is modeled as the parental variable and all the 22 annotations are treated as consequences. Also, we assume there is no direct causal relationship between any two annotations. Therefore the annotations are conditionally independent given Z. (c) Workflow of GenoCanyon functional prediction.
Figure 2
Figure 2
Functional prediction for the HBB gene complex. (a) Histogram of the prediction scores in chromosome 11, HBB gene complex, and the 23 CRMs. 32.2%, 62.2% and 97.0% are predicted as functional, respectively. (b) Prediction results for the HBB complex. Dark blue bars show the prediction score at each location. All the 23 CRMs are marked in red. There appears to be fewer than 23 red bars because some of the CRMs are very close to each other. Red dots indicate the locations of known pathogenic SNPs downloaded from the NCBI Variation Viewer. (c) The posterior probabilities given a single group of annotations could be used to measure the relative contribution of different sources of information (See Methods). Four CRMs are plotted to illustrate that prediction scores are driven by different annotations in different CRMs. (d) Prediction results for the HBB gene and its promoter. The promoter, UTRs, introns and exons are marked with different colors. Red dots show the prediction scores of the pathogenic variants. (e) Prediction results for the HBB promoter. Known protein binding sites in the HBB promoter are marked in blue. Red dots show the prediction scores of the pathogenic variants. (f) Boxplot of the prediction scores of HBB promoter, known protein binding sites, and pathogenic variants.
Figure 3
Figure 3
Prediction results for the SHH enhancer in LMBR1. (a) Prediction scores in the LMBR1 gene. The fifth intron and ZRS are highlighted in light blue and red, respectively. (b) Boxplot of the prediction scores in LMBR1, 16 introns, 17 exons, the 5th intron, and ZRS. The results highlighted the function in the 5th intron of LMBR1 and confirmed the importance of ZRS. (c) Prediction results for the surrounding region of ZRS, which is highlighted in pink. An obvious highly scored plateau can be observed at ZRS. (d) The prediction results within the ZRS. 13 pathogenic variants are discovered in ZRS. The predicted scores at their locations are marked with red dots. There appears to be only 11 dots because three variants all reside at location 156584166 (hg19).
Figure 4
Figure 4
Prediction results for regions involved in human X-inactivation. Each dark blue line shows the prediction score at a single base. (a) Functional prediction for the human XIC. All the RefSeq transcripts in this region are plotted. The master lncRNA XIST is highlighted in red. Red dots show the locations of known pathogenic variants downloaded from the NCBI variation viewer. (b) Functional prediction for the intergenic region between AMOT and HTR2C on chromosome Xq23. A red and a blue arrow represent the recently discovered transcripts XACT and T113.3, respectively.

Similar articles

Cited by

References

    1. Lander E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921, 10.1038/35057062 (2001). - DOI - PubMed
    1. Ng P. C. & Henikoff S. Predicting deleterious amino acid substitutions. Genome research 11, 863–874, 10.1101/gr.176601 (2001). - DOI - PMC - PubMed
    1. Adzhubei I. A. et al. A method and server for predicting damaging missense mutations. Nat Methods 7, 248–249, 10.1038/nmeth0410-248 (2010). - DOI - PMC - PubMed
    1. Schwarz J. M., Rodelsperger C., Schuelke M. & Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods 7, 575–576, 10.1038/nmeth0810-575 (2010). - DOI - PubMed
    1. Elgar G. & Vavouri T. Tuning in to the signals: noncoding sequence conservation in vertebrate genomes. Trends in genetics : TIG 24, 344–352, 10.1016/j.tig.2008.04.005 (2008). - DOI - PubMed

Publication types