Provably sensitive indexing strategies for biosequence similarity search

Research output: Contribution to conferencePaperpeer-review

17 Scopus citations

Abstract

The field of algorithms for pairwise biosequence similarity search is dominated by heuristic methods of high efficiency but uncertain sensitivity. One reason that more formal string matching algorithms with sensitivity guarantees have not been applied to biosequences is that they cannot directly find similarities that score highly under substitution score functions such as the DNAPAM-TT [20], PAM [9], or BLOSUM [12] families of matrices. We describe a general technique, score simulation, to map ungapped similarity search problems using these score functions into the problem of finding pairs of strings that are close in Hamming space. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity searches with formal guarantees of sensitivity using arbitrary score functions. In particular, we introduce the LSH-ALL-PAIRS-SIM algorithm for finding local similarities in large biosequence collections and show that is both computationally feasible and sensitive in practice, particularly for genomic sequence comparison.

Original languageEnglish
Pages90-99
Number of pages10
DOIs
StatePublished - 2002
EventRECOMB 2002: Proceedings of the Sixth Annual International Conference on Computational Biology - Washington, DC, United States
Duration: Apr 18 2002Apr 21 2002

Conference

ConferenceRECOMB 2002: Proceedings of the Sixth Annual International Conference on Computational Biology
Country/TerritoryUnited States
CityWashington, DC
Period04/18/0204/21/02

Fingerprint

Dive into the research topics of 'Provably sensitive indexing strategies for biosequence similarity search'. Together they form a unique fingerprint.

Cite this