Sparse Spatial Kernels for Efficient Large Scale Sequence Classification

Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic

Abstract

In this work, we propose a new family of string-based kernel classification methods, sparse spatial sample kernels (SSSK), that offer low computational cost and display state-of-the art classification performance on a variety of distinct classification tasks. Unlike traditional string kernels that measure sequence similarity using dense substrings on a single scale, SSSKs consider sparse, spatially distributed representation that may occur across multiple scales. This representation has two distinct benefits. 1) It allows inclusion of the relative spatial positions among multiple short strings as well as the string content in the affinity measure; 2) The sparse representation based on short strings results in extremely efficient computational algorithms that scale well both with the sequence alphabet size as well as the size of the training data. At the same time, the resulting classifiers yield state-of-the-art classification performance. While our kernels are motivated by a challenging task of inferring similarity between biological sequences under complex transformations (e.g. highly diverse mutation/insertion/deletion processes), we show that the proposed methods are robust and can be readily used in many distinct sequence classification settings. On the task of protein remote homology detection our method yields significantly better performance than existing state-of-the-art algorithms. We demonstrate similar improvements in text data categorization and classification of music representations. In all three domains SSSKs also show substantial reduction in computing time over the existing methods.