Exploring an experiment-split method to estimate the generalization ability in new data: DeepKme as an example
A Large number of predictors have been built based on different data sets for predicting different post-translational modification sites. However, limited to our knowledge, most of them gave an overfitting estimation of their generalization ability in new data because of the intrinsic trait—not considering the experimental sources of the new data—of the cross-validation method. Thus, we proposed and explored a new method—the experiment-split method—imitating the blinded assessment to deal with the overfitting problem in the new data. The experiment-split method logically split the training and test data based on the data’s different experimental sources, and the new data can be regarded as the data from different experimental sources. To specifically illustrate the experiment-split method, we combined an actual application, DeepKme—a predictor built by us for the lysine methylation sites, to demonstrate how it be used in the true scenarios. We compared the cross-validation method with the experiment-split method. The result suggested the experiment-split method could effectively relieve the overfitting compared with the cross-validation method and may be widely used in the field of identification participated by multiple experiments. We believe DeepKme would facilitate the related researchers’ deep thought of the experiment-split method and the overfitting phenomenon, and of course, advance the study of the lysine methylation and similar fields.