Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Feb 16:7:75.
doi: 10.1186/1471-2105-7-75.

Detecting overlapping coding sequences in virus genomes

Affiliations

Detecting overlapping coding sequences in virus genomes

Andrew E Firth et al. BMC Bioinformatics. .

Abstract

Background: Detecting new coding sequences (CDSs) in viral genomes can be difficult for several reasons. The typically compact genomes often contain a number of overlapping coding and non-coding functional elements, which can result in unusual patterns of codon usage; conservation between related sequences can be difficult to interpret--especially within overlapping genes; and viruses often employ non-canonical translational mechanisms--e.g. frameshifting, stop codon read-through, leaky-scanning and internal ribosome entry sites--which can conceal potentially coding open reading frames (ORFs).

Results: In a previous paper we introduced a new statistic--MLOGD (Maximum Likelihood Overlapping Gene Detector)--for detecting and analysing overlapping CDSs. Here we present (a) an improved MLOGD statistic, (b) a greatly extended suite of software using MLOGD, (c) a database of results for 640 virus sequence alignments, and (d) a web-interface to the software and database. Tests show that, from an alignment with just 20 mutations, MLOGD can discriminate non-overlapping CDSs from non-coding ORFs with a typical accuracy of up to 98%, and can detect CDSs overlapping known CDSs with a typical accuracy of 90%. In addition, the software produces a variety of statistics and graphics, useful for analysing an input multiple sequence alignment.

Conclusion: MLOGD is an easy-to-use tool for virus genome annotation, detecting new CDSs--in particular overlapping or short CDSs--and for analysing overlapping CDSs following frameshift sites. The software, web-server, database and supplementary material are available at http://guinevere.otago.ac.nz/mlogd.html.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Nucleotide-by-nucleotide plot. Example output nucleotide-by-nucleotide plot for the 'Test input query CDSs' option. Luteovirus, six sequences [GenBank:NC_002160, GenBank:NC_003056, GenBank:NC_003369, GenBank:NC_003680, GenBank:NC_004666, GenBank:NC_004750], with NC_002160 as the reference sequence. NC_002160 has six annotated CDSs. CDS3 was used as the query CDS and the remaining five CDSs were taken as the known CDSs. The first panel displays the raw log(LR) statistics at each alignment position. There is a separate track for each reference – non-reference sequence pair (labelled at the right, together with the pairwise divergences). Gaps, and stop codons in each of the null and alternate model CDSs, for each sequence, are marked on the appropriate tracks. The second panel displays the ∑tree log (LR) statistic at each alignment position. The third and fourth panels display sliding window means of the statistics in the first and second panels, respectively. The fifth panel shows the locations of the null and alternate model CDSs. The sixth panel shows the summed mean sequence divergence (mutations per nt) for the sequence pairs that contribute to the ∑tree log (LR) statistic at each alignment position. This is a measure of the information available at each alignment position (e.g. partially gapped regions have lower summed mean sequence divergence). (See website for more details.) The predominantly positive values in the fourth panel show that CDS3 is functionally constrained over the majority of its length.
Figure 2
Figure 2
Six-frame sliding window plot. Example output plot for the 'Six-frame sliding window plots' option (same sequences as in Figure 1). This is a plot of the ∑tree log (LR) statistic calculated in a sliding window along the alignment in each of the six possible read-frames. In each window, the null model is that 'only the known CDS(s) are coding' while the alternate model is that 'both the window and the known CDS(s) are coding'. Panel 1 shows the positions of alignment gaps in each of the input sequences (labelled at right), while panel 2 shows the positions of stop codons in each of the six possible read-frames in each of the input sequences. Panel 3 shows the ∑tree log (LR) statistic in each window in the +0 frame (relative to reference sequence nt 1). The width of each window is indicated by horizontal grey lines (if the reference sequence contains alignment gaps within the window, then the window will appear enlarged in alignment coordinates). The horizontal dashed line is at zero. Panel 4 shows the positions of stop codons in the +0 frame in all the input sequences (same order as in panel 1). Panels 5, 7, 9, 11 and 13 show the same information as panel 3, but for the +1, +2, -0, -1 and -2 frames, respectively. Similarly, panels 6, 8, 10, 12 and 14 show the same information as panel 4, but for the +1, +2, -0, -1 and -2 frames, respectively. Panel 15 shows the known CDSs (here none were entered). Panel 16 shows the summed mean sequence divergence (mutations per nt) at each alignment position (see caption to Figure 1). (See website for more details.) Extended regions of positive signal in panels 3, 5, 7, 9, 11 and 13 indicate potential CDSs (i.e. other than those identified in the null model). In this particular plot, no known CDS(s) were entered, i.e. the null model is that the whole genome is non-coding. Hence the actual Luteovirus CDSs have clear positive signals. Note that several of the reverse read-frames show a false positive signal when they are in the -2 frame relative to a forward read-frame CDS (see website for details).

Similar articles

Cited by

References

    1. Stormo GD. Gene-finding approaches for eukaryotes. Genome Res. 2000;10:394–397. doi: 10.1101/gr.10.4.394. - DOI - PubMed
    1. Badger JH, Olsen GJ. CRITICA: Coding Region Identification Tool Invoking Comparative Analysis. Mol Biol Evol. 1999;16:512–524. - PubMed
    1. Majoros WH, Pertea M, Salzberg SL. Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics. 2005;21:1782–1788. doi: 10.1093/bioinformatics/bti297. - DOI - PubMed
    1. Firth AE, Brown CM. Detecting overlapping coding sequences with pairwise alignments. Bioinformatics. 2005;21:282–292. doi: 10.1093/bioinformatics/bti007. - DOI - PubMed
    1. Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. 2004. http://evolution.genetics.washington.edu/phylip.html

Publication types

LinkOut - more resources