Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Aug 1;31(15):4639-45.
doi: 10.1093/nar/gkg483.

Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis

Affiliations

Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis

Flavio Mignone et al. Nucleic Acids Res. .

Abstract

The identification of conserved sequence tags (CSTs) through comparative genome analysis may reveal important regulatory elements involved in shaping the spatio-temporal expression of genetic information. It is well known that the most significant fraction of CSTs observed in human-mouse comparisons correspond to protein coding exons, due to their strong evolutionary constraints. As we still do not know the complete gene inventory of the human and mouse genomes it is of the utmost importance to establish if detected conserved sequences are genes or not. We propose here a simple algorithm that, based on the observation of the specific evolutionary dynamics of coding sequences, efficiently discriminates between coding and non-coding CSTs. The application of this method may help the validation of predicted genes, the prediction of alternative splicing patterns in known and unknown genes and the definition of a dictionary of non-coding regulatory elements.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of CPSs from the CSTfinder analysis of the RANDOM set.
Figure 2
Figure 2
Results obtained from the CSTfinder analysis of the GENOME set showing the percentage of CSTs falling in different gene annotation categories.
Figure 3
Figure 3
Detailed result of CSTfinder analysis on five human–mouse homologous gene loci in the GENOME dataset belonging to different EnsEMBL gene classes. (A) Known gene; (B) novel gene; (C) EST gene; (D and E) GenScan predicted gene. Upper boxes represent identified CSTs (green, CPS > 500; red, 30 > CPS ≤ 500; black, CPS ≤ 30) with lower boxes corresponding to known or predicted exons. For each gene the EnsEMBL ID, the chromosome position and the coordinates (NCBI 30 release) are reported. The arrow highlights a CSTfinder predicted coding sequence missed by GenScan but coincident with a TwinScan predicted exon.

Similar articles

Cited by

References

    1. Delcher A.L., Harmon,D., Kasif,S., White,O. and Salzberg,S.L. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res., 27, 4636–4641. - PMC - PubMed
    1. Besemer J., Lomsadze,A. and Borodovsky,M. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res., 29, 2607–2618. - PMC - PubMed
    1. Burge C. and Karlin,S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94. - PubMed
    1. Krogh A. (2000) Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res., 10, 523–528. - PMC - PubMed
    1. Rogic S., Mackworth,A.K. and Ouellette,F.B. (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res., 11, 817–832. - PMC - PubMed

Publication types

Associated data