Key Points
-
Gene prioritization aims to integrate complex, heterogeneous data to identify the most promising genes for biological validation among a set of candidates. Its goal is to help biological researchers who face mountains of public and private omics data to maximize the yield of downstream biological validation.
-
Prioritization methods leverage prior knowledge of the phenotype or biological process of interest, either in the form of keywords describing the phenotype of interest or of sets of genes that were previously associated to the phenotype or the process. They then either profile data from candidates against this prior knowledge or diffuse this knowledge across a biological network to identify the most closely associated candidates; methods also exist for the case in which little or no prior knowledge is available.
-
Gene prioritization has contributed to the discovery of many disease-causing genes. High ranking of a candidate gene in prioritization for a phenotype is now accepted as contributing evidence in proving that mutations in this gene cause the phenotype.
-
Numerous prioritization tools are publicly available, often via the Web, and they can easily be used by biologists without specific bioinformatics expertise. Although no tool performs best in all situations, the different tools cover together most experimental situations in which gene prioritization is useful.
-
Computational validation of prioritization results — using procedures such as cross-validation, appropriate negative controls and functional enrichment — is essential to guarantee the effectiveness of the prioritization. More complex prioritization strategies are available to increase the effectiveness of prioritization methods further.
-
Although prioritization methods are now firmly established, many refinements that improve their performance and usability by biologists can be expected. Moreover, prioritization of sequencing variants identified by next-generation sequencing is emerging as a major need for the biological community, in which data integration can have an important role and for which new prioritization strategies are needed.
Abstract
At different stages of any research project, molecular biologists need to choose — often somewhat arbitrarily, even after careful statistical data analysis — which genes or proteins to investigate further experimentally and which to leave out because of limited resources. Computational methods that integrate complex, heterogeneous data sets — such as expression data, sequence information, functional annotation and the biomedical literature — allow prioritizing genes for future study in a more informed way. Such methods can substantially increase the yield of downstream studies and are becoming invaluable to researchers.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
References
Aerts, S. et al. Gene prioritization through genomic data fusion. Nature Biotech. 24, 537–544 (2006). This is the original description of the prioritization tool Endeavour, which uses a similarity profiling strategy.
Franke, L. et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 78, 1011–1025 (2006). This is the original description of the prioritization tool Prioritizer, which relies on a human functional network.
Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Association of genes to genetically inherited diseases using data mining. Nature Genet. 31, 316–319 (2002).
Thiel, C. T. et al. Severely incapacitating mutations in patients with extreme short stature identify RNA-processing endoribonuclease RMRP as an essential cell growth regulator. Am. J. Hum. Genet. 77, 795–806 (2005).
van Driel, M. A., Cuelenaere, K., Kemmeren, P. P.C. W., Leunissen, J. A. M. & Brunner, H. G. A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur. J. Hum. Genet. 11, 57–63 (2003).
Sparrow, D. B., Guillén-Navarro, E., Fatkin, D. & Dunwoodie, S. L. Mutation of hairy-and-enhancer-of-split-7 in humans causes spondylocostal dysostosis. Hum. Mol. Genet. 17, 3761–3766 (2008).
Rajab, A. et al. Fatal cardiac arrhythmia and long-QT syndrome in a new form of congenital generalized lipodystrophy with muscle rippling (CGL4) due to PTRF-CAVIN mutations. PLoS Genet. 6, e1000874 (2010).
Kaufmann, R . et al. Infantile cerebral and cerebellar atrophy is associated with a mutation in the MED17 subunit of the transcription preinitiation mediator complex. Am. J. Hum. Genet. 87, 667–670 (2010). This study shows that MED17 mutations are associated with infantile cerebral and cerebellar atrophy using GeneDistiller.
Spinazzola, A. et al. MPV17 encodes an inner mitochondrial membrane protein and is mutated in infantile hepatic mitochondrial DNA depletion. Nature Genet. 38, 570–575 (2006).
Seelow, D., Schwarz, J. M. & Schuelke, M. GeneDistiller—distilling candidate genes from linkage intervals. PLoS ONE 3, e3874 (2008).
George, R. A. et al. Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res. 34, e130 (2006).
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114 (2012).
Flicek, P. et al. Ensembl 2012. Nucleic Acids Res. 40, D84–D90 (2012).
Dreszer, T. R. et al. The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 40, D918–D923 (2012).
Parkinson, H. et al. ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 39, D1002–D1004 (2011).
Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–1121 (2011).
van Vliet-Ostaptchouk, J. V. et al. HHEX gene polymorphisms are associated with type 2 diabetes in the Dutch Breda cohort. Eur. J. Hum. Genet. 16, 652–656 (2008). This is a biological validation of Prioritizer, showing that variants near the HHEX gene contribute to the risk of T2D in a Dutch population.
Pers, T. H. et al. Meta-analysis of heterogeneous data sources for genome-scale identification of risk genes in complex phenotypes. Genet. Epidemiol. 35, 318–332 (2011).
Cantor, R. M., Lange, K. & Sinsheimer, J. S. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86, 6–22 (2010).
Perez-Iratxeta, C., Bork, P. & Andrade-Navarro, M. A. Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Res. 35, W212–W216 (2007).
Tremblay, K. et al. Genes to diseases (G2D) computational method to identify asthma candidate genes. PLoS ONE 3, e2907 (2008).
Aerts, S. et al. Integrating computational biology and forward genetics in Drosophila. PLoS Genet. 5, e1000351 (2009).
Goh, K.-I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).
Smith, N. G. C. & Eyre-Walker, A. Human disease genes: patterns and predictions. Gene 318, 169–175 (2003).
Oti, M. & Brunner, H. G. The modular nature of genetic diseases. Clin. Genet. 71, 1–11 (2007). This paper provides a motivation to use the guilt by association principle to identify novel disease causing genes.
Rual, J.-F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173–1178 (2005).
Lage, K. et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature Biotech. 25, 309–316 (2007).
Tiffin, N., Andrade-Navarro, M. A. & Perez-Iratxeta, C. Linking genes to diseases: it's all in the data. Genome Med. 1, 77 (2009). In this paper, a discussion is presented of how disease gene discovery will be facilitated by improved data integration and the use of clinical data.
Lanckriet, G. R. G., De Bie, T., Cristianini, N., Jordan, M. I. & Noble, W. S. A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635 (2004).
De Bie, T., Tranchevent, L.-C., van Oeffelen, L. M. M. & Moreau, Y. Kernel-based data fusion for gene prioritization. Bioinformatics 23, i125–i132 (2007).
Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. & Botstein, D. A. Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA 100, 8348–8353 (2003).
Kondor, R. I. & Lafferty, J. Diffusion kernels on graphs and other discrete input spaces. Proc. 19th Int. Conf. Machine Learning 2002, 315–322 (2002).
Tranchevent, L.-C. et al. A guide to web tools to prioritize candidate genes. Brief. Bioinformat. 12, 22–32 (2011). This paper discusses a Web portal describing multiple prioritization tools and supporting the selection of appropriate tools for given requirements.
Oti, M., Ballouz, S. & Wouters, M. A. Web tools for the prioritization of candidate disease genes. Methods Mol. Biol. 760, 189–206 (2011). This paper provides a detailed description of several Web-based prioritization methods together with their specificities.
Tiffin, N. Conceptual thinking for in silico prioritization of candidate disease genes. Methods Mol. Biol. 760, 175–187 (2011). This is a review on gene prioritization that also describes the development of your own data integration method.
Piro, R. M. & Di Cunto, F. Computational approaches to disease-gene prediction: rationale, classification and successes. FEBS J. 279, 678–696 (2012). This review focuses on the different data sources and the algorithms underlying the prioritization methods.
Kann, M. G. Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Brief. Bioinformat. 11, 96–110 (2010).
Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255 (2003).
Ma, X., Lee, H., Wang, L. & Sun, F. CGI: a new approach for prioritizing genes by combining gene expression and protein-protein interaction data. Bioinformatics 23, 215–221 (2007).
Jenssen, T. K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).
Barabási, A.-L., Gulbahce, N. & Loscalzo, J. Network medicine: a network-based approach to human disease. Nature Rev. Genet. 12, 56–68 (2011). This is a review of network-based methods to unravel the molecular mechanisms underlying diseases.
Nitsch, D. et al. PINTA: a web server for network-based gene prioritization from expression data. Nucleic Acids Res. 39, W334–W338 (2011).
Keyser, R. J., Oppon, E., Carr, J. A. & Bardien, S. Identification of Parkinson's disease candidate genes using CAESAR and screening of MAPT and SNCAIP in South African Parkinson's disease patients. J. Neural Transm. 118, 889–897 (2011).
Oti, M., Huynen, M. A. & Brunner, H. G. The biological coherence of human phenome databases. Am. J. Hum. Genet. 85, 801–808 (2009).
Hamosh, A., Scott, A. F., Amberger, J., Valle, D. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM). Hum. Mutat. 15, 57–61 (2000).
Antonarakis, S. E. & McKusick, V. A. OMIM passes the 1,000-disease-gene mark. Nature Genet. 25, 11 (2000).
Becker, K. G., Barnes, K. C., Bright, T. J. & Wang, S. A. The genetic association database. Nature Genet. 36, 431–432 (2004).
Doms, A. & Schroeder, M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 33, W783–W786 (2005).
Racine, J. et al. Comparison of genomic and proteomic data in recurrent airway obstruction affected horses using ingenuity pathway analysis®. BMC Vet. Res. 7, 48 (2011).
Thomas, S. & Bonchev, D. A survey of current software for network analysis in molecular biology. Hum. Genom. 4, 353–360 (2010).
Wickramasinghe, S., Rincon, G., Islas-Trejo, A. & Medrano, J. F. Transcriptional profiling of bovine milk using RNA sequencing. BMC Genom. 13, 45 (2012).
Ekins, S., Nikolsky, Y., Bugrim, A., Kirillov, E. & Nikolskaya, T. Pathway mapping tools for analysis of high content data. Methods Mol. Biol. 356, 319–350 (2007).
Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 21, 577–581 (2003).
Stenson, P. D. et al. The Human Gene Mutation Database: 2008 update. Genome Med. 1, 13 (2009).
Franke, L. et al. TEAM: a tool for the integration of expression, and linkage and association maps. Eur. J. Hum. Genet. 12, 633–638 (2004).
Bush, W. S., Dudek, S. M. & Ritchie, M. D. Biofilter: a knowledge-integration system for the multi-locus analysis of genome-wide association studies. Pac. Symp. Biocomput. 14, 368–379 (2009).
Krallinger, M., Valencia, A. & Hirschman, L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 9 (Suppl. 2), S8 (2008).
Winnenburg, R., Wächter, T., Plake, C., Doms, A. & Schroeder, M. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief. Bioinformat. 9, 466–478 (2008).
Schadt, E. E. Molecular networks as sensors and drivers of common human diseases. Nature 461, 218–223 (2009).
Baudot, A., Gómez-López, G. & Valencia, A. Translational disease interpretation with molecular networks. Genome Biol. 10, 221 (2009).
Vidal, M., Cusick, M. E. & Barabási, A.-L . Interactome networks and human disease. Cell 144, 986–998 (2011).
Yu, W., Wulf, A., Liu, T., Khoury, M. J. & Gwinn, M. Gene Prospector: an evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases. BMC Bioinformat. 9, 528 (2008).
Van Vooren, S. et al. Mapping biomedical concepts onto the human genome by mining literature on chromosomal aberrations. Nucleic Acids Res. 35, 2533–2543 (2007).
Firth, H. V. et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 84, 524–533 (2009).
Kowald, A. & Schmeier, S. Data Mining in Proteomics. Inform. Retrieval 696, 305–318 (Humana Press, 2011).
Tranchevent, L.-C. et al. ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids Res. 36, W377–W384 (2008).
Chen, J., Bardes, E. E., Aronow, B. J. & Jegga, A. G. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37, W305–W311 (2009).
Fontaine, J.-F., Priller, F., Barbosa-Silva, A. & Andrade-Navarro, M. A. Génie: literature-based gene prioritization at multi genomic scale. Nucleic Acids Res. 39, W455–W461 (2011).
Britto, R. et al. GPSy: a cross-species gene prioritization system for conserved biological processes—application in male gamete development. Nucleic Acids Res. 8 May 2012 (doi:10.1093/nar/gks380).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Kann, M. G. Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief. Bioinformat. 8, 333–346 (2007).
Navlakha, S. & Kingsford, C. The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063 (2010). This is a recent review about predicting disease–gene associations using gene–protein networks and network-based algorithms.
Köhler, S., Bauer, S., Horn, D. & Robinson, P. N. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82, 949–958 (2008).
Chen, J., Xu, H., Aronow, B. J. & Jegga, A. G. Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformat. 8, 392 (2007).
Breitkreutz, B.-J., Stark, C. & Tyers, M. The GRID: the General Repository for Interaction Datasets. Genome Biol. 4, R23 (2003).
Linghu, B., Snitkin, E. S., Hu, Z., Xia, Y. & Delisi, C. Genome-wide prioritization of disease genes and identification of disease–disease associations from an integrated human functional linkage network. Genome Biol. 10, R91 (2009).
Snel, B., Lehmann, G., Bork, P. & Huynen, M. A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 28, 3442–3444 (2000).
López-Bigas, N. & Ouzounis, C. A. Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res. 32, 3108–3114 (2004).
Adie, E. A., Adams, R. R., Evans, K. L., Porteous, D. J. & Pickard, B. S. Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformat. 6, 55 (2005).
Thornblad, T. A., Elliott, K. S., Jowett, J. & Visscher, P. M. Prioritization of positional candidate genes using multiple web-based software tools. Twin Res. Hum. Genet. 10, 861–870 (2007).
Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M. A. G2D: a tool for mining genes associated with disease. BMC Genet. 6, 45 (2005).
Hutz, J. E., Kraja, A. T., McLeod, H. L. & Province, M. A. CANDID: a flexible method for prioritizing candidate genes for complex human traits. Genet. Epidemiol. 32, 779–790 (2008).
Cheng, D. et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 36, W399–W405 (2008).
Tiffin, N. et al. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Res. 34, 3067–3081 (2006). This is an example of the application of prioritization to a complex disorder using multiple prediction algorithms to create a consensus.
Teber, E. T., Liu, J. Y., Ballouz, S., Fatkin, D. & Wouters, M. A. Comparison of automated candidate gene prediction systems using genes implicated in type 2 diabetes by genome-wide association studies. BMC Bioinformatics 10 (Suppl. 1), S69 (2009).
Elbers, C. C. et al. A strategy to search for common obesity and type 2 diabetes genes. Trends Endocrinol. Metab. 18, 19–26 (2007).
Thienpont, B. et al. Haploinsufficiency of TAB2 causes congenital heart defects in humans. Am. J. Hum. Genet. 86, 839–849 (2010). This is a biological validation of Endeavour that shows a role for TAB2 in human cardiac development.
Qiao, Y. et al. Outcome of array CGH analysis for 255 subjects with intellectual disability and search for candidate genes using bioinformatics. Hum. Genet. 128, 179–194 (2010).
Hwang, S., Rhee, S. Y., Marcotte, E. M. & Lee, I. Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network. Nature Protoc. 6, 1429–1442 (2011).
Hess, D. C. et al. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genet. 5, e1000407 (2009).
Huttenhower, C. et al. Exploring the human genome with functional maps. Genome Res. 19, 1093–1106 (2009).
Lee, I. et al. Genetic dissection of the biotic stress response using a genome-scale gene network for rice. Proc. Natl Acad. Sci. USA 108, 18548–18553 (2011).
Kohavi, R. A. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc. 15th Int. Joint Comp. Artificial Intelligence 2, 1137–1143 (1995).
Chen, Y. et al. In silico gene prioritization by integrating multiple data sources. PLoS ONE 6, e21137 (2011).
Schuierer, S., Tranchevent, L.-C., Dengler, U. & Moreau, Y. Large-scale benchmark of Endeavour using MetaCore maps. Bioinformatics 26, 1922–1923 (2010).
Huttenhower, C. et al. The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction. Bioinformatics 25, 2404–2410 (2009).
Erlich, Y. et al. Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis. Genome Res. 21, 658–664 (2011). This is a study in which traditional mapping methods, new sequencing tools and network analysis are combined to identify the causal mutation for a rare monogenic disease.
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).
Szklarczyk, D. et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39, D561–D568 (2011).
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protoc. 4, 44–57 (2009).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Casci, T. Human disease: something old, something new. Nature Rev. Genet. 12, 382–383 (2011).
Gillis, J. & Pavlidis, P. The impact of multifunctional genes on “guilt by association” analysis. PLoS ONE 6, e17258 (2011).
Gillis, J. & Pavlidis, P. “Guilt by association” is the exception rather than the rule in gene networks. PLoS Comput. Biol. 8, e1002444 (2012).
Moult, J., Hubbard, T., Bryant, S. H., Fidelis, K. & Pedersen, J. T. Critical assessment of methods of protein structure prediction (CASP): round II. Proteins 29 (Suppl. 1), 2–6 (1997).
Moult, J., Fidelis, K., Kryshtafovych, A. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)—round IX. Proteins 79 (Suppl. 1), 1–5 (2011).
Arighi, C. N. et al. BioCreative III interactive task: an overview. BMC Bioinformatics 12 (Suppl. 8), S4 (2011).
Hirschman, L., Yeh, A., Blaschke, C. & Valencia, A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6 (Suppl. 1), S1 (2005).
Tilstone, C. DNA microarrays: vital statistics. Nature 424, 610–612 (2003).
Johnson, K. & Lin, S. Call to work together on microarray data analysis. Nature 411, 885 (2001).
Prill, R. J., Saez-Rodriguez, J., Alexopoulos, L. G., Sorger, P. K. & Stolovitzky, G. Crowdsourcing network inference: the DREAM predictive signaling network challenge. Sci. Signal. 4, mr7 (2011).
Stein, L. D. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nature Rev. Genet. 9, 678–688 (2008).
Yoshida, Y. et al. PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning. Nucleic Acids Res. 37, W147–W152 (2009).
Mardis, E. R. et al. Recurring mutations found by sequencing an acute myeloid leukemia genome. N. Engl. J. Med. 361, 1058–1066 (2009).
Lupski, J. R. et al. Whole-genome sequencing in a patient with Charcot–Marie–Tooth neuropathy. N. Engl. J. Med. 362, 1181–1191 (2010).
Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nature Rev. Genet. 12, 628–640 (2011).
Zhong, Q. et al. Edgetic perturbation models of human inherited disorders. Mol. Syst. Biol. 5, 321 (2009).
Kuhn, M., von Mering, C., Campillos, M., Jensen, L. J. & Bork, P. STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res. 36, D684–D688 (2008).
Baron, D. et al. MADGene: retrieval and processing of gene identifier lists for the analysis of heterogeneous microarray datasets. Bioinformatics 27, 725–726 (2011).
Chen, R., Li, L. & Butte, A. J. AILUN: reannotating gene expression data automatically. Nature Methods 4, 879 (2007).
Robinson, P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008).
Osborne, J. D. et al. Annotating the human genome with Disease Ontology. BMC Genomics 10 (Suppl. 1), S6 (2009).
Smedley, D. et al. BioMart—biological queries made easy. BMC Genom. 10, 22 (2009).
O'Brien, K. P., Remm, M. & Sonnhammer, E. L. L. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33, D476–D480 (2005).
Yu, H. et al. Annotation transfer between genomes: protein–protein interologs and protein-DNA regulogs. Genome Res. 14, 1107–1118 (2004).
Ebermann, I. et al. A novel gene for Usher syndrome type 2: mutations in the long isoform of whirlin are associated with retinitis pigmentosa and sensorineural hearing loss. Hum. Genet. 121, 203–211 (2007).
Barriot, R. et al. Collaboratively charting the gene-to-phenotype network of human congenital heart defects. Genome Med. 2, 16 (2010). This study describes CHDWiki, the first knowledge portal to annotate and analyse gene–phenotype networks collaboratively.
Acknowledgements
This work was supported in part by the following grants: KUL PFV/10/016 SymBioSys, KUL GOA MaNet, Hercules III PacBio RS and FP7-HEALTH CHeartED.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary information S1 (table)
This document represents a tutorial about gene prioritization methods. (XLS 523 kb)
Glossary
- Homozygosity mapping
-
A form of recombination mapping that allows the localization of rare recessive traits by identifying unusually long stretches of homozygosity at consecutive markers.
- Guilt by association
-
A statistical rule of thumb that asserts that reliable predictions about the function or disease involvement ('guilt') of a gene or protein can generally be made if several of its partners (for example, genes with correlated expression profiles or protein–protein interaction partners) share a corresponding 'guilty' status ('association').
- Machine learning methods
-
The design and development of algorithms that allow computers automatically to learn to recognize complex patterns in data and to make intelligent decisions on the basis of such data.
- Principal components analysis
-
A statistical method that is used to simplify a complex data set by transforming a series of correlated variables into a smaller number of uncorrelated variables called principal components.
- Interologue
-
A protein–protein interaction that is conserved between orthologous proteins in different species.
- Random walk
-
A mathematical formalization of the path resulting from taking successive random steps. Classical examples of random walks are Brownian motion, the fortune of a gambler flipping a coin or fluctuations of the stock market. In the context of graphs, a random walk typically describes a process in which a 'walker' moves from one node of the graph into another with a probability proportional to the weight of the edge connecting them.
- Diffusion kernel
-
A type of kernel similarity matrix that is derived from the notion of a random walk on a graph. Diffusion kernels measure similarity between nodes of a graph (in this case, between genes) — for example, by estimating the average length of a random walk from one node to the other.
- Locus heterogeneity
-
The appearance of phenotypically similar characteristics that result from mutations at different genetic loci. Differences in effect size or in replication between studies and samples are often ascribed to different loci leading to the same disease.
- Multiple testing
-
A statistical problem that arises from carrying out multiple hypothesis tests together. P values obtained from hypothesis tests under the assumption of a single test must be appropriately corrected to reflect multiple testing.
Rights and permissions
About this article
Cite this article
Moreau, Y., Tranchevent, LC. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet 13, 523–536 (2012). https://doi.org/10.1038/nrg3253
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg3253
This article is cited by
-
Prediction of lncRNA-disease association based on a Laplace normalized random walk with restart algorithm on heterogeneous networks
BMC Bioinformatics (2022)
-
TLGP: a flexible transfer learning algorithm for gene prioritization based on heterogeneous source domain
BMC Bioinformatics (2021)
-
Network hub-node prioritization of gene regulation with intra-network association
BMC Bioinformatics (2020)
-
Network modeling of patients' biomolecular profiles for clinical phenotype/outcome prediction
Scientific Reports (2020)
-
Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneous graphs
Scientific Reports (2020)