Fast and Accurate Multi-class Protein Fold Recognition
with Spatial Sample Kernels
Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Abstract
Establishing structural and/or functional relationship between sequences,
for instance to infer the structural class of an unannotated protein, is a
key task in biological sequence analysis.
Recent computational methods such as profile and neighborhood mismatch
kernel have shown promising results;
however, the incurred computational cost can be prohibitive in practice.
In this study we address the multi-class sequence classification problem
using a class of string-based kernels (SSSK), that are both biologically
motivated and efficient to compute.
Application of the proposed methods to the multi-class protein prediction
problems (fold recognition and remote homology detection)
yields significantly better performance than existing state-of-the-art
algorithms.
The proposed methods can work with very large databases
of protein sequences because of low computational complexity and
show substantial improvements in computing time over the existing methods.
Supplementary Data
Datasets
Fold
Superfamily
Ding
Notes:
-
All experiments are performed using
SPIDER
machine learning package.
-
The files are in bzip2 format.