Fast and Accurate Multi-class Protein Fold Recognition with Spatial Sample Kernels

Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic

Abstract

Establishing structural and/or functional relationship between sequences, for instance to infer the structural class of an unannotated protein, is a key task in biological sequence analysis. Recent computational methods such as profile and neighborhood mismatch kernel have shown promising results; however, the incurred computational cost can be prohibitive in practice. In this study we address the multi-class sequence classification problem using a class of string-based kernels (SSSK), that are both biologically motivated and efficient to compute. Application of the proposed methods to the multi-class protein prediction problems (fold recognition and remote homology detection) yields significantly better performance than existing state-of-the-art algorithms. The proposed methods can work with very large databases of protein sequences because of low computational complexity and show substantial improvements in computing time over the existing methods.

Supplementary Data

Datasets

Fold

triple(1,3) Kernel (non-redundant)
double(1,5) Kernel (non-redundant)
profile Kernel (non-redundant)

Superfamily

triple(1,3) Kernel (non-redundant)
double(1,5) Kernel (non-redundant)
profile Kernel (non-redundant)

Ding

triple(1,3) Kernel (non-redundant)
double(1,5) Kernel (non-redundant)
profile Kernel (non-redundant)

Notes: