Fast and Accurate Multi-class Protein Fold Recognition with Spatial Sample Kernels

Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic

Abstract

Establishing structural and/or functional relationship between sequences, for instance to infer the structural class of an unannotated protein, is a key task in biological sequence analysis. Recent computational methods such as profile and neighborhood mismatch kernel have shown promising results; however, the incurred computational cost can be prohibitive in practice. In this study we address the multi-class sequence classification problem using a class of string-based kernels (SSSK), that are both biologically motivated and efficient to compute. Application of the proposed methods to the multi-class protein prediction problems (fold recognition and remote homology detection) yields significantly better performance than existing state-of-the-art algorithms. The proposed methods can work with very large databases of protein sequences because of low computational complexity and show substantial improvements in computing time over the existing methods.

Supplementary Data

Datasets

Fold Dataset
- The dataset is obtained from http://www.ccls.columbia.edu/compbio/adaptive/
Superfamily Dataset
- The dataset is obtained from http://www.ccls.columbia.edu/compbio/adaptive/
Ding and Dubchak Dataset
- The dataset is obtained from http://ranger.uta.edu/~chqding/bioinfo.html

Fold

triple(1,3) Kernel (non-redundant)

double(1,5) Kernel (non-redundant)

profile Kernel (non-redundant)

Superfamily

triple(1,3) Kernel (non-redundant)

double(1,5) Kernel (non-redundant)

profile Kernel (non-redundant)

Ding

triple(1,3) Kernel (non-redundant)

double(1,5) Kernel (non-redundant)

profile Kernel (non-redundant)

Notes:

All experiments are performed using SPIDER machine learning package.
The files are in bzip2 format.