Fast Kernel Algorithms for Strings

Pavel Kuksa, Vladimir Pavlovic

Abstract

We present a new family of linear time algorithms for string comparison with mismatches under the string kernels framework. Based on sufficient statistics, our algorithms improve theoretical complexity bounds of existing approaches while scaling well in sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets and under loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times.

Supplementary Data

Codes for string kernels

String kernels in C Includes implementations of new fast kernel algorithms for strings (see papers below) such as spectrum, mismatch, gapped (subsequence), wildcard, and spatial sample kernels.

References

Scalable Algorithms for String Kernels with Inexact Matching (NIPS 2008 paper)

Sparse Spatial Sample Kernels (ICPR paper)

For more details also refer to our project pages.