Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;9(3):e1002958.
doi: 10.1371/journal.pcbi.1002958. Epub 2013 Mar 21.

A discriminative approach for unsupervised clustering of DNA sequence motifs

Affiliations

A discriminative approach for unsupervised clustering of DNA sequence motifs

Philip Stegmaier et al. PLoS Comput Biol. 2013.

Abstract

Algorithmic comparison of DNA sequence motifs is a problem in bioinformatics that has received increased attention during the last years. Its main applications concern characterization of potentially novel motifs and clustering of a motif collection in order to remove redundancy. Despite growing interest in motif clustering, the question which motif clusters to aim at has so far not been systematically addressed. Here we analyzed motif similarities in a comprehensive set of vertebrate transcription factor classes. For this we developed enhanced similarity scores by inclusion of the information coverage (IC) criterion, which evaluates the fraction of information an alignment covers in aligned motifs. A network-based method enabled us to identify motif clusters with high correspondence to DNA-binding domain phylogenies and prior experimental findings. Based on this analysis we derived a set of motif families representing distinct binding specificities. These motif families were used to train a classifier which was further integrated into a novel algorithm for unsupervised motif clustering. Application of the new algorithm demonstrated its superiority to previously published methods and its ability to reproduce entrained motif families. As a result, our work proposes a probabilistic approach to decide whether two motifs represent common or distinct binding specificities.

PubMed Disclaimer

Conflict of interest statement

I have read the journal's policy and have the following conflicts: PS is an employee of Biobase GmbH, Germany. AK and EW are employees of geneXplain GmbH, Germany. JB is an employee of Medizinische Hochschule Hannover. There are no patents, products in development, or marketed products to declare. This does not alter the authors' adherence to all the PLOS Computational Biology policies on sharing data and materials, as detailed online in the guide for authors.

Figures

Figure 1
Figure 1. Intra-class alignments cover a higher fraction of motif information than inter-class alignments.
(A) Example alignments illustrate the information coverage (IC) criterion. Depicted are m2match outputs of an intra-class alignment for two TFs of the BHLH class E47 and MyoD (top) and an inter-class alignment for the E47 motif and the PFM of MADS transcription factor RSRF (bottom). (B) Histograms of IC values observed in intra-class and inter-class alignments. Alignments were selected using the Euclidean distance (ED) score and information coverage was calculated using the sqr formula (Material and Methods). In total there were 436080 inter-class and 64420 intra-class alignments. Intra-class alignments showed a tendency for higher IC than inter-class alignments and specifically exhibited a pronounced peak at high IC values which is absent in the inter-class distribution.
Figure 2
Figure 2. Best hit and class-depth statistics achieved by different methods.
The plots cover the 12 largest classes of the Transfac set with at least 20 motifs. Each bar group represents one motif class. (A) Best hit (B) Class-depth.
Figure 3
Figure 3. Motif network and DNA-binding domain phylogeny for the ZFC4-NR class.
(A) Motif network of nuclear receptor motifs with colors indicating clusters extracted by MCL. (B) Phylogeny of nuclear receptor DNA-binding domains represented by matrices in the motif network. Branch colors correspond to MCL clusters in A. (C) Motif logos were generated using WebLogo for binding sites of NR3C proteins (top), estrogen receptor (middle), and nuclear receptors from other families (bottom). The half-site logos illustrate that estrogen receptor motifs were correctly clustered separately from NR3C matrices and with the other nuclear receptors.
Figure 4
Figure 4. ED.sqr scores for inter-class, intra-class, and intra-family alignments.
(A) Scatter plot of ED.sqr scores and alignment space values observed in inter-class alignments. The alignment space was the product of aligned motif lengths, which is proportional to the number of possible alignments. Curves show conditional mean and variance estimates (2σ above and below the mean) obtained with non-parametric regression. (B) Histograms of adjusted ED.sqr scores for inter-class (light) and intra-class alignments (dark). (C) Histograms of adjusted ED.sqr scores for inter-class (light) and intra-family alignments (dark).
Figure 5
Figure 5. Clustering of 71 non-zinc finger motifs from Jaspar.
Gray boxes between dendrogram and matrix names indicate motif clusters. The dotted line points out the 50% motif family threshold. Some clusters were merged below that threshold, because FBPs formed in the course of the clustering process provided for a better presentation of the motif family than the basic motifs.
Figure 6
Figure 6. Optimization of α-parameters applied in ED and SSD scores.
Optimization selected α- parameters for best performance according to best hit (red) and class-depth statistics (blue) in the range from 0.05 to 0.95. Different subsets of TF classes such as the 5 largest (dashed lines), classes with at least 20 (solid lines) as well as with at least 10 matrices (dotted lines) were also considered. Optimal alpha values were 0.5 for ED.ave and ED.sqr scores, 0.55 for the ED score, 0.25 for SSD.ave and SSD.sqr scores, as well as 0.3 for the SSD score and are indicated by gray dotted lines.

Similar articles

Cited by

References

    1. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34: D108–110. - PMC - PubMed
    1. Bryne JC, Valen E, Tang MH, Marstrand T, Winther O, et al. (2008) JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res 36: D102–106. - PMC - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 316: 1497–1502. - PubMed
    1. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, et al. (2000) Genome-wide location and function of DNA binding proteins. Science 290: 2306–2309. - PubMed
    1. Kielbasa SM, Gonze D, Herzel H (2005) Measuring similarities between transcription factor binding sites. BMC Bioinformatics 28: 237. - PMC - PubMed

Publication types

Grants and funding

PS was partially funded by the ERA-Net EuroTransBio-5 project “ANEUDIA.” The work of AK was funded by the Russian federal program “Living systems,” State Contract #11.519.11.2031 and by FP7 project “SysCol” and BMBF project “GerontoShield.” The author JB gratefully acknowledges support from The Virtual Liver Network (grant 031 6154) of the German Federal Ministry of Education and Research (BMBF). PS, AK, and EW were further supported by the EU 7th Framework project “LipidomicNet” (grant no. 202272). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.