Aim and Objective:
The accurate identification of protein-ligand binding sites helps
elucidate protein function and facilitate the design of new drugs. Machine-learning-based methods
have been widely used for the prediction of protein-ligand binding sites. Nevertheless, the severe
class imbalance phenomenon, where the number of nonbinding (majority) residues is far greater
than that of binding (minority) residues, has a negative impact on the performance of such
machine-learning-based predictors.
Materials and Methods:
In this study, we aim to relieve the negative impact of class imbalance by
Boosting Multiple Granular Support Vector Machines (BGSVM). In BGSVM, each base SVM is
trained on a granular training subset consisting of all minority samples and some reasonably
selected majority samples. The efficacy of BGSVM for dealing with class imbalance was validated
by benchmarking it with several typical imbalance learning algorithms. We further implemented a
protein-nucleotide binding site predictor, called BGSVM-NUC, with the BGSVM algorithm.
Results:
Rigorous cross-validation and independent validation tests for five types of proteinnucleotide
interactions demonstrated that the proposed BGSVM-NUC achieves promising
prediction performance and outperforms several popular sequence-based protein-nucleotide
binding site predictors. The BGSVM-NUC web server is freely available at
http://csbio.njust.edu.cn/bioinf/BGSVM-NUC/ for academic use.