Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 27;11(1):15269.
doi: 10.1038/s41598-021-94742-z.

Text mining of gene-phenotype associations reveals new phenotypic profiles of autism-associated genes

Affiliations

Text mining of gene-phenotype associations reveals new phenotypic profiles of autism-associated genes

Sijie Li et al. Sci Rep. .

Abstract

Autism is a spectrum disorder with wide variation in type and severity of symptoms. Understanding gene-phenotype associations is vital to unravel the disease mechanisms and advance its diagnosis and treatment. To date, several databases have stored a large portion of gene-phenotype associations which are mainly obtained from genetic experiments. However, a large proportion of gene-phenotype associations are still buried in the autism-related literature and there are limited resources to investigate autism-associated gene-phenotype associations. Given the abundance of the autism-related literature, we were thus motivated to develop Autism_genepheno, a text mining pipeline to identify sentence-level mentions of autism-associated genes and phenotypes in literature through natural language processing methods. We have generated a comprehensive database of gene-phenotype associations in the last five years' autism-related literature that can be easily updated as new literature becomes available. We have evaluated our pipeline through several different approaches, and we are able to rank and select top autism-associated genes through their unique and wide spectrum of phenotypic profiles, which could provide a unique resource for the diagnosis and treatment of autism. The data resources and the Autism_genpheno pipeline are available at: https://github.com/maiziezhoulab/Autism_genepheno .

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
The overview pipeline of Autism_genepheno. For the standardized phenotype, it includes the unique UMLS concept ID, its preferred name in UMLS, its vocabulary source, and its corresponding HPO ID if it exists.
Figure 2
Figure 2
Top mentioned autism-associated genes correspond to the classes of SFARI genes. (A) The percentage distribution of different classes of SFARI genes and “NA” genes in our data resource (left panel). The percentage distribution of different classes of SFARI genes associated standardized phenotypes and “NA” genes associated standardized phenotypes in our data resource (right panel). (B) Top 30 mentioned autism-associated genes. (c) Top 30 mentioned autism-associated standardized phenotypes. The standardized phenotype and top-level phenotypic category are separated by “|” here for each term. Top-level phenotypic category “NA” means the standardized phenotype is not included in HPO.
Figure 3
Figure 3
gene–phenotype evaluation rate through HPO for all classes of SFARI genes and “NA” genes which are not included in the SFARI database but in the VariCarta database.
Figure 4
Figure 4
Autism-associated gene classification through top-level phenotypic categories and Kmeans clustering. (A) The t-SNE plot of all autism-associated genes labeled by top-level phenotypic categories. (B) The t-SNE plot of all autism-associated genes labeled by Kmeans clustering results. (C) Distribution of top-level phenotypic categories for top, central and bottom gene clusters marked in black circles from A. (D) The GO analysis for cluster 1 (red cluster) and 13 (blue cluster) from Kmeans clustering.
Figure 5
Figure 5
The unique phenotypic profiles of four classes of SFARI genes. (A) The spatial distribution of SFARI genes on the t-SNE plot. (B) The proportion of each class of SFARI genes belongs to each top-level phenotypic category. Red: SFARI class 1, Blue: SFARI class 2, Green: SFARI Class 3, and Yellow: SFARI Class S. (C) The genetic interaction network graph of those top 10% SFARI genes with highest betweenness centrality scores.

Similar articles

Cited by

References

    1. Young AI, Benonisdottir S, Przeworski M, Kong A. Deconstructing the sources of genotype-phenotype associations in humans. Science. 2019;365:1396–1400. doi: 10.1126/science.aax3710. - DOI - PMC - PubMed
    1. Kafkas, Ş & Hoehndorf, R. Ontology based text mining of gene–phenotype associations: application to candidate gene prediction. Database2019, (2019). - PMC - PubMed
    1. Babbi G, Martelli PL, Casadio R. Phenpath: a tool for characterizing biological functions underlying different phenotypes. BMC genomics. 2019;20:1–11. doi: 10.1186/s12864-019-5868-x. - DOI - PMC - PubMed
    1. Nisar S, et al. Association of genes with phenotype in autism spectrum disorder. Aging (Albany NY) 2019;11:10742. doi: 10.18632/aging.102473. - DOI - PMC - PubMed
    1. Levy SE, Mandell DS, Schultz RT. Autism. Lancet. 2009;374:1627–1638. doi: 10.1016/S0140-6736(09)61376-3. - DOI - PMC - PubMed

Publication types