IMR Press / FBL / Volume 26 / Issue 7 / DOI: 10.52586/4935
Open Access Original Research
Prediction of diabetic protein markers based on an ensemble method
Show Less
1 School of Computer and Software, Nanyang Institute of Technology, 473004 Nanyang, Henan, China
2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054 Chengdu, Sichuan, China
3 Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 324000 Quzhou, Zhejiang, China
4 School of Opto-electronic and Communication Engineering, Xiamen University of Technology, 361024 Xiamen, Fujian, China
*Correspondence: shihua@xmut.edu.cn (Hua Shi)
Front. Biosci. (Landmark Ed) 2021, 26(7), 207–221; https://doi.org/10.52586/4935
Submitted: 26 May 2021 | Revised: 10 June 2021 | Accepted: 25 June 2021 | Published: 30 July 2021
Copyright: © 2021 The Author(s). Published by BRI.
This is an open access article under the CC BY 4.0 license (https://creativecommons.org/licenses/by/4.0/).
Abstract

Introduction: A diabetic protein marker is a type of protein that is closely related to diabetes. This kind of protein plays an important role in the prevention and diagnosis of diabetes. Therefore, it is necessary to identify an effective method for predicting diabetic protein markers. In this study, we propose using ensemble methods to predict diabetic protein markers. Methodological issues: The ensemble method consists of two aspects. First, we combine a feature extraction method to obtain mixed features. Next, we classify the protein using ensemble classifiers. We use three feature extraction methods in the ensemble method, including composition and physicochemical features (abbreviated as 188D), adaptive skip gram features (abbreviated as 400D) and g-gap (abbreviated as 670D). There are six traditional classifiers in this study: decision tree, Naive Bayes, logistic regression, part, k-nearest neighbor, and kernel logistic regression. The ensemble classifiers are random forest and vote. First, we used feature extraction methods and traditional classifiers to classify protein sequences. Then, we compared the combined feature extraction methods with single methods. Next, we compared ensemble classifiers to traditional classifiers. Finally, we used ensemble classifiers and combined feature extraction methods to predict samples. Results: The results indicated that ensemble methods outperform single methods with respect to either ensemble classifiers or combined feature extraction methods. When the classifier is a random forest and the feature extraction method is 588D (combined 188D and 400D), the performance is best among all methods. The second best ensemble feature extraction method is 1285D (combining the three methods) with random forest. The best single feature extraction method is 188D, and the worst one is g-gap. Conclusion: According to the results, the ensemble method, either the combined feature extraction method or the ensemble classifier, was better than the single method. We anticipate that ensemble methods will be a useful tool for identifying diabetic protein markers in a cost-effective manner.

Keywords
Diabetic protein marker
Machine learning
Feature extraction method
Ensemble classifiers
Dimensionality reduction
Figures
Fig. 1.
Share
Back to top