Pavel P. Kuksa's Publications

Sorted by DateClassified by Publication TypeDefault Ordering

Efficient use of unlabeled data for protein sequence classification: a comparative study

Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Efficient use of unlabeled data for protein sequence classification: a comparative study. BMC Bioinformatics, 10(Suppl 4):S2, 2009. Impact factor: 3.78

Download

[PDF]173.3kB  [URL] 

Abstract

BACKGROUND:Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers.RESULTS:Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time.CONCLUSION:The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.

BibTeX

@article{bmc2009unlabeled,
	Abstract = {BACKGROUND:Recent studies in computational primary protein sequence
	analysis have leveraged the power of unlabeled data. For example,
	predictive models based on string kernels trained on sequences known
	to belong to particular folds or superfamilies, the so-called labeled
	data set, can attain significantly improved accuracy if this data
	is supplemented with protein sequences that lack any class tags-the
	unlabeled data. In this study, we present a principled and biologically
	motivated computational framework that more effectively exploits
	the unlabeled data by only using the sequence regions that are more
	likely to be biologically relevant for better prediction accuracy.
	As overly-represented sequences in large uncurated databases may
	bias the estimation of computational models that rely on unlabeled
	data, we also propose a method to remove this bias and improve performance
	of the resulting classifiers.RESULTS:Combined with state-of-the-art
	string kernels, our proposed computational framework achieves very
	accurate semi-supervised protein remote fold and homology detection
	on three large unlabeled databases. It outperforms current state-of-the-art
	methods and exhibits significant reduction in running time.CONCLUSION:The
	unlabeled sequences used under the semi-supervised setting resemble
	the unpolished gemstones; when used as-is, they may carry unnecessary
	features and hence compromise the classification accuracy but once
	cut and polished, they improve the accuracy of the classifiers considerably.},
	Author = {Kuksa, Pavel and Huang, Pai-Hsi and Pavlovic, Vladimir},
	Bib2Html_Pubtype = {Journal},
	Doi = {10.1186/1471-2105-10-S4-S2},
	Issn = {1471-2105},
	Journal = {BMC Bioinformatics},
	Note = {Impact factor: 3.78},
	Number = {Suppl 4},
	Pages = {S2},
	Pubmedid = {19426450},
	Title = {Efficient use of unlabeled data for protein sequence classification: a comparative study},
	Url = {http://www.biomedcentral.com/1471-2105/10/S4/S2},
	Volume = {10},
	Year = {2009},
	Bdsk-Url-1 = {http://www.biomedcentral.com/1471-2105/10/S4/S2},
	Bdsk-Url-2 = {http://dx.doi.org/10.1186/1471-2105-10-S4-S2}}

Generated by bib2html.pl (written by Patrick Riley ) on Tue Mar 17, 2020 17:09:56