TY - JOUR
T1 - Provably Sensitive Indexing Strategies for Biosequence Similarity Search
AU - Buhler, Jeremy
PY - 2003
Y1 - 2003
N2 - Fast algorithms for pairwise biosequence similarity search frequently use filtering and indexing strategies to identify potential matches between a query sequence and a database. For the most part, these strategies are not informed by the substitution score matrices commonly used by comparison algorithms to assign numerical scores to pairs of aligned residues. Consequently, although many filtering strategies offer strong formal guarantees about their ability to detect pairs of sequences differing by few substitutions, these methods can make no guarantee of detecting pairs with high similarity scores. We describe a general technique, score simulation, to help resolve the tension between existing filtering techniques and the use of score matrices. Score simulation, using score matrices, maps ungapped similarity search problems to the simpler problem of finding pairs of strings that differ by few substitutions. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity search with arbitrary score matrices while maintaining strong formal guarantees of sensitivity. We introduce the LSH-ALL-PAIRS-SIM algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.
AB - Fast algorithms for pairwise biosequence similarity search frequently use filtering and indexing strategies to identify potential matches between a query sequence and a database. For the most part, these strategies are not informed by the substitution score matrices commonly used by comparison algorithms to assign numerical scores to pairs of aligned residues. Consequently, although many filtering strategies offer strong formal guarantees about their ability to detect pairs of sequences differing by few substitutions, these methods can make no guarantee of detecting pairs with high similarity scores. We describe a general technique, score simulation, to help resolve the tension between existing filtering techniques and the use of score matrices. Score simulation, using score matrices, maps ungapped similarity search problems to the simpler problem of finding pairs of strings that differ by few substitutions. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity search with arbitrary score matrices while maintaining strong formal guarantees of sensitivity. We introduce the LSH-ALL-PAIRS-SIM algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.
KW - Alignment
KW - Biosequence comparison
KW - Metric embedding
KW - Score matrices
UR - http://www.scopus.com/inward/record.url?scp=0242606474&partnerID=8YFLogxK
U2 - 10.1089/10665270360688093
DO - 10.1089/10665270360688093
M3 - Article
C2 - 13677335
AN - SCOPUS:0242606474
SN - 1066-5277
VL - 10
SP - 399
EP - 417
JO - Journal of Computational Biology
JF - Journal of Computational Biology
IS - 3-4
ER -