An algorithm for finding signals of unknown length in DNA sequences
- PMID: 11473011
- DOI: 10.1093/bioinformatics/17.suppl_1.s207
An algorithm for finding signals of unknown length in DNA sequences
Abstract
Pattern discovery in unaligned DNA sequences is a challenging problem in both computer science and molecular biology. Several different methods and techniques have been proposed so far, but in most of the cases signals in DNA sequences are very complicated and avoid detection. Exact exhaustive methods can solve the problem only for short signals with a limited number of mutations. In this work, we extend exhaustive enumeration also to longer patterns. More in detail, the basic version of algorithm presented in this paper, given as input a set of sequences and an error ratio epsilon < 1, finds all patterns that occur in at least q sequences of the set with at most epsilonm mutations, where m is the length of the pattern. The only restriction is imposed on the location of mutations along the signal. That is, a valid occurrence of a pattern can present at most [epsiloni] mismatches in the first i nucleotides, and so on. However, we show how the algorithm can be used also when no assumption can be made on the position of mutations. In this case, it is also possible to have an estimate of the probability of finding a signal according to the signal length, the error ratio, and the input parameters. Finally, we discuss some significance measures that can be used to sort the patterns output by the algorithm.
Similar articles
-
SIMD parallelization of the WORDUP algorithm for detecting statistically significant patterns in DNA sequences.Comput Appl Biosci. 1993 Dec;9(6):701-7. doi: 10.1093/bioinformatics/9.6.701. Comput Appl Biosci. 1993. PMID: 8143157
-
An improved heuristic algorithm for finding motif signals in DNA sequences.IEEE/ACM Trans Comput Biol Bioinform. 2011 Jul-Aug;8(4):959-75. doi: 10.1109/TCBB.2010.92. IEEE/ACM Trans Comput Biol Bioinform. 2011. PMID: 20855921
-
Finding composite regulatory patterns in DNA sequences.Bioinformatics. 2002;18 Suppl 1:S354-63. doi: 10.1093/bioinformatics/18.suppl_1.s354. Bioinformatics. 2002. PMID: 12169566
-
An O(N2) algorithm for discovering optimal Boolean pattern pairs.IEEE/ACM Trans Comput Biol Bioinform. 2004 Oct-Dec;1(4):159-70. doi: 10.1109/TCBB.2004.36. IEEE/ACM Trans Comput Biol Bioinform. 2004. PMID: 17051698
-
The emergence of pattern discovery techniques in computational biology.Metab Eng. 2000 Jul;2(3):159-77. doi: 10.1006/mben.2000.0151. Metab Eng. 2000. PMID: 11056059 Review.
Cited by
-
Towards a theoretical understanding of false positives in DNA motif finding.BMC Bioinformatics. 2012 Jun 27;13:151. doi: 10.1186/1471-2105-13-151. BMC Bioinformatics. 2012. PMID: 22738169 Free PMC article.
-
A survey on deep learning in DNA/RNA motif mining.Brief Bioinform. 2021 Jul 20;22(4):bbaa229. doi: 10.1093/bib/bbaa229. Brief Bioinform. 2021. PMID: 33005921 Free PMC article.
-
Comparative genomic reconstruction of transcriptional regulatory networks in bacteria.Chem Rev. 2007 Aug;107(8):3467-97. doi: 10.1021/cr068309+. Epub 2007 Jul 18. Chem Rev. 2007. PMID: 17636889 Free PMC article. Review. No abstract available.
-
SOMEA: self-organizing map based extraction algorithm for DNA motif identification with heterogeneous model.BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-12-S1-S16. BMC Bioinformatics. 2011. PMID: 21342545 Free PMC article.
-
Closest string with outliers.BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S55. doi: 10.1186/1471-2105-12-S1-S55. BMC Bioinformatics. 2011. PMID: 21342588 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources