TY - JOUR
T1 - Designing seeds for similarity search in genomic DNA
AU - Buhler, Jeremy
AU - Keich, Uri
AU - Sun, Yanni
N1 - Funding Information:
The authors thank Tom Joseph for invaluable help with the infrastructure to run the experiments; Ming Li, Tomas Vinar, and Brona Brejova for encouragement to work on finding asymptotically optimal seeds; Steven Altschul and Alejandro Schäffer for helpful suggestions on Mandala’s design; and anonymous referee for many helpful suggestions. Authors J.B. and Y.S. were supported by NSF CAREER award DBI-0237902.
PY - 2005/5
Y1 - 2005/5
N2 - Large-scale comparison of genomic DNA is of fundamental importance in annotating functional elements of genomes. To perform large comparisons efficiently, BLAST (Methods: Companion Methods Enzymol 266 (1996) 460, J. Mol. Biol. 215 (1990) 403, Nucleic Acids Res. 25(17) (1997) 3389) and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common pattern or "seed'' of matching bases. The literature suggests that the choice of seed substantially affects the sensitivity of seeded alignment, but designing and evaluating seeds is computationally challenging. This work addresses the problem of designing a seed to optimize performance of seeded alignment. We give a fast, simple algorithm based on finite automata for evaluating the sensitivity of a seed in a Markov model of ungapped alignments, along with extensions to mixtures and inhomogeneous Markov models. We give intuition and theoretical results on which seeds are good choices. Finally, we describe Mandala, a software tool for seed design, and show that it can be used to improve the sensitivity of alignment in practice.
AB - Large-scale comparison of genomic DNA is of fundamental importance in annotating functional elements of genomes. To perform large comparisons efficiently, BLAST (Methods: Companion Methods Enzymol 266 (1996) 460, J. Mol. Biol. 215 (1990) 403, Nucleic Acids Res. 25(17) (1997) 3389) and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common pattern or "seed'' of matching bases. The literature suggests that the choice of seed substantially affects the sensitivity of seeded alignment, but designing and evaluating seeds is computationally challenging. This work addresses the problem of designing a seed to optimize performance of seeded alignment. We give a fast, simple algorithm based on finite automata for evaluating the sensitivity of a seed in a Markov model of ungapped alignments, along with extensions to mixtures and inhomogeneous Markov models. We give intuition and theoretical results on which seeds are good choices. Finally, we describe Mandala, a software tool for seed design, and show that it can be used to improve the sensitivity of alignment in practice.
KW - Biosequence comparison
KW - Genomic DNA
KW - Mandala
KW - Seeded alignment
KW - String matching
UR - https://www.scopus.com/pages/publications/15344345700
U2 - 10.1016/j.jcss.2004.12.003
DO - 10.1016/j.jcss.2004.12.003
M3 - Article
AN - SCOPUS:15344345700
SN - 0022-0000
VL - 70
SP - 342
EP - 363
JO - Journal of Computer and System Sciences
JF - Journal of Computer and System Sciences
IS - 3
ER -