Detecting overlapping coding sequences in virus genomes

doi:10.1186/1471-2105-7-75

. 2006 Feb 16:7:75.

doi: 10.1186/1471-2105-7-75.

Detecting overlapping coding sequences in virus genomes

Andrew E Firth¹, Chris M Brown

Affiliations

PMID: 16483358
PMCID: PMC1395342
DOI: 10.1186/1471-2105-7-75

Detecting overlapping coding sequences in virus genomes

Andrew E Firth et al. BMC Bioinformatics. 2006.

. 2006 Feb 16:7:75.

doi: 10.1186/1471-2105-7-75.

Authors

Andrew E Firth¹, Chris M Brown

Affiliation

¹ Department of Biochemistry, University of Otago, PO Box 56, Dunedin, New Zealand. aef@sanger.otago.ac.nz

PMID: 16483358
PMCID: PMC1395342
DOI: 10.1186/1471-2105-7-75

Abstract

Background: Detecting new coding sequences (CDSs) in viral genomes can be difficult for several reasons. The typically compact genomes often contain a number of overlapping coding and non-coding functional elements, which can result in unusual patterns of codon usage; conservation between related sequences can be difficult to interpret--especially within overlapping genes; and viruses often employ non-canonical translational mechanisms--e.g. frameshifting, stop codon read-through, leaky-scanning and internal ribosome entry sites--which can conceal potentially coding open reading frames (ORFs).

Results: In a previous paper we introduced a new statistic--MLOGD (Maximum Likelihood Overlapping Gene Detector)--for detecting and analysing overlapping CDSs. Here we present (a) an improved MLOGD statistic, (b) a greatly extended suite of software using MLOGD, (c) a database of results for 640 virus sequence alignments, and (d) a web-interface to the software and database. Tests show that, from an alignment with just 20 mutations, MLOGD can discriminate non-overlapping CDSs from non-coding ORFs with a typical accuracy of up to 98%, and can detect CDSs overlapping known CDSs with a typical accuracy of 90%. In addition, the software produces a variety of statistics and graphics, useful for analysing an input multiple sequence alignment.

Conclusion: MLOGD is an easy-to-use tool for virus genome annotation, detecting new CDSs--in particular overlapping or short CDSs--and for analysing overlapping CDSs following frameshift sites. The software, web-server, database and supplementary material are available at http://guinevere.otago.ac.nz/mlogd.html.

PubMed Disclaimer

Figures

**Figure 1**
**Nucleotide-by-nucleotide plot**. Example output nucleotide-by-nucleotide plot for the 'Test input query CDSs' option. Luteovirus, six sequences [GenBank:NC_002160, GenBank:NC_003056, GenBank:NC_003369, GenBank:NC_003680, GenBank:NC_004666, GenBank:NC_004750], with NC_002160 as the reference sequence. NC_002160 has six annotated CDSs. CDS3 was used as the query CDS and the remaining five CDSs were taken as the known CDSs. The first panel displays the raw log(LR) statistics at each alignment position. There is a separate track for each reference – non-reference sequence pair (labelled at the right, together with the pairwise divergences). Gaps, and stop codons in each of the null and alternate model CDSs, for each sequence, are marked on the appropriate tracks. The second panel displays the ∑_treelog (LR) statistic at each alignment position. The third and fourth panels display sliding window means of the statistics in the first and second panels, respectively. The fifth panel shows the locations of the null and alternate model CDSs. The sixth panel shows the summed mean sequence divergence (mutations per nt) for the sequence pairs that contribute to the ∑_treelog (LR) statistic at each alignment position. This is a measure of the information available at each alignment position (e.g. partially gapped regions have lower summed mean sequence divergence). (See website for more details.) The predominantly positive values in the fourth panel show that CDS3 is functionally constrained over the majority of its length.

**Figure 2**
**Six-frame sliding window plot**. Example output plot for the 'Six-frame sliding window plots' option (same sequences as in Figure 1). This is a plot of the ∑_treelog (LR) statistic calculated in a sliding window along the alignment in each of the six possible read-frames. In each window, the null model is that 'only the known CDS(s) are coding' while the alternate model is that 'both the window and the known CDS(s) are coding'. Panel 1 shows the positions of alignment gaps in each of the input sequences (labelled at right), while panel 2 shows the positions of stop codons in each of the six possible read-frames in each of the input sequences. Panel 3 shows the ∑_treelog (LR) statistic in each window in the +0 frame (relative to reference sequence nt 1). The width of each window is indicated by horizontal grey lines (if the reference sequence contains alignment gaps within the window, then the window will appear enlarged in alignment coordinates). The horizontal dashed line is at zero. Panel 4 shows the positions of stop codons in the +0 frame in all the input sequences (same order as in panel 1). Panels 5, 7, 9, 11 and 13 show the same information as panel 3, but for the +1, +2, -0, -1 and -2 frames, respectively. Similarly, panels 6, 8, 10, 12 and 14 show the same information as panel 4, but for the +1, +2, -0, -1 and -2 frames, respectively. Panel 15 shows the known CDSs (here none were entered). Panel 16 shows the summed mean sequence divergence (mutations per nt) at each alignment position (see caption to Figure 1). (See website for more details.) Extended regions of positive signal in panels 3, 5, 7, 9, 11 and 13 indicate potential CDSs (i.e. other than those identified in the null model). In this particular plot, no known CDS(s) were entered, i.e. the null model is that the whole genome is non-coding. Hence the actual Luteovirus CDSs have clear positive signals. Note that several of the reverse read-frames show a false positive signal when they are in the -2 frame relative to a forward read-frame CDS (see website for details).

See this image and copyright information in PMC

Cited by

Evolution of viral proteins originated de novo by overprinting.
Sabath N, Wagner A, Karlin D. Sabath N, et al. Mol Biol Evol. 2012 Dec;29(12):3767-80. doi: 10.1093/molbev/mss179. Epub 2012 Jul 19. Mol Biol Evol. 2012. PMID: 22821011 Free PMC article.
Mapping overlapping functional elements embedded within the protein-coding regions of RNA viruses.
Firth AE. Firth AE. Nucleic Acids Res. 2014 Nov 10;42(20):12425-39. doi: 10.1093/nar/gku981. Epub 2014 Oct 17. Nucleic Acids Res. 2014. PMID: 25326325 Free PMC article.
On programmed ribosomal frameshifting: the alternative proteomes.
Ketteler R. Ketteler R. Front Genet. 2012 Nov 19;3:242. doi: 10.3389/fgene.2012.00242. eCollection 2012. Front Genet. 2012. PMID: 23181069 Free PMC article.
Orientation-dependent toxic effect of human papillomavirus type 33 long control region DNA in Escherichia coli cells.
Gyöngyösi E, Szalmás A, Kónya J, Veress G. Gyöngyösi E, et al. Virus Genes. 2020 Jun;56(3):298-305. doi: 10.1007/s11262-020-01754-4. Epub 2020 Apr 3. Virus Genes. 2020. PMID: 32246353 Free PMC article.
Stimulation of stop codon readthrough: frequent presence of an extended 3' RNA structural element.
Firth AE, Wills NM, Gesteland RF, Atkins JF. Firth AE, et al. Nucleic Acids Res. 2011 Aug;39(15):6679-91. doi: 10.1093/nar/gkr224. Epub 2011 Apr 27. Nucleic Acids Res. 2011. PMID: 21525127 Free PMC article.

See all "Cited by" articles

References

1. Stormo GD. Gene-finding approaches for eukaryotes. Genome Res. 2000;10:394–397. doi: 10.1101/gr.10.4.394. - DOI - PubMed
1. Badger JH, Olsen GJ. CRITICA: Coding Region Identification Tool Invoking Comparative Analysis. Mol Biol Evol. 1999;16:512–524. - PubMed
1. Majoros WH, Pertea M, Salzberg SL. Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics. 2005;21:1782–1788. doi: 10.1093/bioinformatics/bti297. - DOI - PubMed
1. Firth AE, Brown CM. Detecting overlapping coding sequences with pairwise alignments. Bioinformatics. 2005;21:282–292. doi: 10.1093/bioinformatics/bti007. - DOI - PubMed
1. Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. 2004. http://evolution.genetics.washington.edu/phylip.html

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources

[1] Stormo GD. Gene-finding approaches for eukaryotes. Genome Res. 2000;10:394–397. doi: 10.1101/gr.10.4.394. - DOI - PubMed

[2] Stormo GD. Gene-finding approaches for eukaryotes. Genome Res. 2000;10:394–397. doi: 10.1101/gr.10.4.394. - DOI - PubMed

[3] Badger JH, Olsen GJ. CRITICA: Coding Region Identification Tool Invoking Comparative Analysis. Mol Biol Evol. 1999;16:512–524. - PubMed

[4] Badger JH, Olsen GJ. CRITICA: Coding Region Identification Tool Invoking Comparative Analysis. Mol Biol Evol. 1999;16:512–524. - PubMed

[5] Majoros WH, Pertea M, Salzberg SL. Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics. 2005;21:1782–1788. doi: 10.1093/bioinformatics/bti297. - DOI - PubMed

[6] Majoros WH, Pertea M, Salzberg SL. Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics. 2005;21:1782–1788. doi: 10.1093/bioinformatics/bti297. - DOI - PubMed

[7] Firth AE, Brown CM. Detecting overlapping coding sequences with pairwise alignments. Bioinformatics. 2005;21:282–292. doi: 10.1093/bioinformatics/bti007. - DOI - PubMed

[8] Firth AE, Brown CM. Detecting overlapping coding sequences with pairwise alignments. Bioinformatics. 2005;21:282–292. doi: 10.1093/bioinformatics/bti007. - DOI - PubMed

[9] Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. 2004. http://evolution.genetics.washington.edu/phylip.html

[10] Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. 2004. http://evolution.genetics.washington.edu/phylip.html

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Detecting overlapping coding sequences in virus genomes

Affiliation

Detecting overlapping coding sequences in virus genomes

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources