Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Apr;14(4):708-15.
doi: 10.1101/gr.1933104.

Aligning multiple genomic sequences with the threaded blockset aligner

Affiliations

Aligning multiple genomic sequences with the threaded blockset aligner

Mathieu Blanchette et al. Genome Res. 2004 Apr.

Abstract

We define a "threaded blockset," which is a novel generalization of the classic notion of a multiple alignment. A new computer program called TBA (for "threaded blockset aligner") builds a threaded blockset under the assumption that all matching segments occur in the same order and orientation in the given sequences; inversions and duplications are not addressed. TBA is designed to be appropriate for aligning many, but by no means all, megabase-sized regions of multiple mammalian genomes. The output of TBA can be projected onto any genome chosen as a reference, thus guaranteeing that different projections present consistent predictions of which genomic positions are orthologous. This capability is illustrated using a new visualization tool to view TBA-generated alignments of vertebrate Hox clusters from both the mammalian and fish perspectives. Experimental evaluation of alignment quality, using a program that simulates evolutionary change in genomic sequences, indicates that TBA is more accurate than earlier programs. To perform the dynamic-programming alignment step, TBA runs a stand-alone program called MULTIZ, which can be used to align highly rearranged or incompletely sequenced genomes. We describe our use of MULTIZ to produce the whole-genome multiple alignments at the Santa Cruz Genome Browser.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(A) Blocks (alignments) of a hypothetical threaded blockset for sequences h (400 bp), m (400 bp) and r (350 bp). Only the range of positions in each alignment is given. (B) Projection of the threaded blockset onto m.
Figure 2
Figure 2
(A) Alignments between the chloroplast genomes of Arabidopsis thaliana and Oenothera elata (evening primrose). Lines running from lower left to upper right indicate positions of matches on the forward strand (relative to the GenBank entries, NC_000932 and OEL271079, respectively), and lines running from upper left to lower right indicate matches in reverse complement. The alignments were computed and displayed by programs used by the PipMaker Web server (Schwartz et al. 2000). (B) Blocks of a threaded blockset for the chloroplast genomes of Arabidopsis and evening primrose.
Figure 3
Figure 3
A threaded blockset for vertebrate HoxA regions, displayed in our interactive blockset viewer Gmaj. (A) The red circle marks a position of interest where the tilapia reference sequence aligns with human. The block containing this position is highlighted in red in all of the alignment panels. Color underlays are blue for exons in the reference sequence and yellow for introns, and the exons are also represented as icons above the alignments. At the top of the Gmaj window, two status lines describe the positions of the mouse pointer and the red circle, respectively. Individual nucleotides for the selected block are displayed in the bottom pane, with the marked position highlighted. (B) The same region projected onto the human sequence. The underlays for human include (green) for EST evidence, (dark blue) for antisense RNA, and (red) for coding sequences. The conserved element from A is part of an alternative 5′-end identified by homology to a human EST from TIGR.
Figure 4
Figure 4
(A) Accuracy of the multiple alignments produced by different aligners on a set of nine simulated mammalian sequences of length ∼50 kb, as measured on the basis of the pairwise alignments induced by different pairs of species. The scores reported are the average of 50 simulation experiments. See the Methods section (Supplemental material) for an explanation of the R parameter. (B) Accuracy of the multiple alignments produced by different aligners on simulated human, mouse, and rat sequences of length ∼50 kb, as measured on the basis of the pairwise alignments induced by different pairs of species. The scores reported are the average of 50 simulation experiments.
Figure 5
Figure 5
Pictorial representation of an application of MULTIZ. M is a human-ref blockset of human, mouse, and rat, whereas N is a cow-ref blockset of cow and dog. MULTIZ uses a pairwise human-ref blockset, G, of human and cow to guide the aligning process. The output is a human-ref blockset of human, mouse, rat, cow, and dog. The reference sequence for each blockset is indicated by capital letters.
Figure 6
Figure 6
UCSC Genome Browser display of HUMOR alignments. (A) Ribosomal protein RPL31. The human/mouse/rat track shows the MULTIZ score normalized as described in the text. The high conservation of exons relative to introns is typical of many genes. (B) Transcription Factor FOS. In highly regulated genes such as this one, it is not unusual to find extensive conservation outside of protein-coding exons. (C) Closeup of a poorly conserved part of a RPL31 intron. When the display is zoomed in close enough, the base-by-base alignment is displayed as well as the score graph. Because the alignment is projected onto the reference sequence, a “Hidden Gaps” row indicates areas where in the full alignment there would be dashes in the reference sequence row. Clicking on the human/mouse/rat track takes you to a details page that displays the full alignment. (D) Closeup of an exon/intron boundary in FOS. The canonical “GT” 5′ consensus sequence is usually conserved, but then conservation falls off for the rest of the intron.
Figure 7
Figure 7
The TBA implementation.

Similar articles

Cited by

References

    1. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301-1310. - PubMed
    1. Bray, N. and Pachter, L. 2003. MAVID multiple alignment server. Nucleic Acids Res. 31: 3525-3526. - PMC - PubMed
    1. Brudno, M. and Morgenstern, B. 2002. Fast and sensitive alignment of large genomic sequences. In Proceedings of the IEEE Computer Society Bioinformatics Conference, pp. 138-150. IEEE Press. - PubMed
    1. Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., Batzoglou, S., and NISC Comparative Sequencing Program. 2003. LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13: 721-731. - PMC - PubMed
    1. Collins, F.S., Green, E.D., Guttmacher, A.E., and Guyer, M.S. 2003. A vision for the future of genomics research. Nature 422: 835-847. - PubMed

WEB SITE REFERENCES

    1. http://bio.cse.psu.edu/; TBA, simulated test data, and the Gmaj visualization tool.
    1. http://genome.ucsc.edu; MULTIZ and HUMOR alignments.

Publication types

MeSH terms

LinkOut - more resources