Provably Sensitive Indexing Strategies for Biosequence Similarity Search

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Fast algorithms for pairwise biosequence similarity search frequently use filtering and indexing strategies to identify potential matches between a query sequence and a database. For the most part, these strategies are not informed by the substitution score matrices commonly used by comparison algorithms to assign numerical scores to pairs of aligned residues. Consequently, although many filtering strategies offer strong formal guarantees about their ability to detect pairs of sequences differing by few substitutions, these methods can make no guarantee of detecting pairs with high similarity scores. We describe a general technique, score simulation, to help resolve the tension between existing filtering techniques and the use of score matrices. Score simulation, using score matrices, maps ungapped similarity search problems to the simpler problem of finding pairs of strings that differ by few substitutions. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity search with arbitrary score matrices while maintaining strong formal guarantees of sensitivity. We introduce the LSH-ALL-PAIRS-SIM algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.

Original languageEnglish
Pages (from-to)399-417
Number of pages19
JournalJournal of Computational Biology
Volume10
Issue number3-4
DOIs
StatePublished - 2003

Keywords

  • Alignment
  • Biosequence comparison
  • Metric embedding
  • Score matrices

Fingerprint

Dive into the research topics of 'Provably Sensitive Indexing Strategies for Biosequence Similarity Search'. Together they form a unique fingerprint.

Cite this