scholarly journals Exploring an experiment-split method to estimate the generalization ability in new data: DeepKme as an example

2021 ◽  
Author(s):  
Guoyang Zou ◽  
Lei Li

A Large number of predictors have been built based on different data sets for predicting different post-translational modification sites. However, limited to our knowledge, most of them gave an overfitting estimation of their generalization ability in new data because of the intrinsic trait—not considering the experimental sources of the new data—of the cross-validation method. Thus, we proposed and explored a new method—the experiment-split method—imitating the blinded assessment to deal with the overfitting problem in the new data. The experiment-split method logically split the training and test data based on the data’s different experimental sources, and the new data can be regarded as the data from different experimental sources. To specifically illustrate the experiment-split method, we combined an actual application, DeepKme—a predictor built by us for the lysine methylation sites, to demonstrate how it be used in the true scenarios. We compared the cross-validation method with the experiment-split method. The result suggested the experiment-split method could effectively relieve the overfitting compared with the cross-validation method and may be widely used in the field of identification participated by multiple experiments. We believe DeepKme would facilitate the related researchers’ deep thought of the experiment-split method and the overfitting phenomenon, and of course, advance the study of the lysine methylation and similar fields.

2021 ◽  
Vol 17 (12) ◽  
pp. e1009682
Author(s):  
Guoyang Zou ◽  
Yang Zou ◽  
Chenglong Ma ◽  
Jiaojiao Zhao ◽  
Lei Li

Many computational classifiers have been developed to predict different types of post-translational modification sites. Their performances are measured using cross-validation or independent test, in which experimental data from different sources are mixed and randomly split into training and test sets. However, the self-reported performances of most classifiers based on this measure are generally higher than their performances in the application of new experimental data. It suggests that the cross-validation method overestimates the generalization ability of a classifier. Here, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome (Kme) as an example and developed a deep learning-based Kme site predictor (called DeepKme) with outstanding performance. We assessed the experiment-split test by comparing it with the cross-validation method. We found that the performance measured using the experiment-split test is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the predictor. Therefore, we believe that the experiment-split method can be applied to benchmark the practical performance of a given PTM model. DeepKme is free accessible via https://github.com/guoyangzou/DeepKme.


2020 ◽  
Vol 17 ◽  
Author(s):  
Hongwei Liu ◽  
Bin Hu ◽  
Lei Chen ◽  
Lin Lu

Background: Identification of protein subcellular location is an important problem because the subcellular location is highly related to protein function. It is fundamental to determine the locations with biology experiments. However, these experiments are of high costs and time-consuming. The alternative way to address such problem is to design effective computational methods. Objective: To date, several computational methods have been proposed in this regard. However, these methods mainly adopted the features derived from proteins themselves. On the other hand, with the development of network technique, several embedding algorithms have been proposed, which can encode nodes in the network into feature vectors. Such algorithms connected the network and traditional classification algorithms. Thus, they provided a new way to construct models for the prediction of protein subcellular location. Method: In this study, we analyzed features produced by three network embedding algorithms (DeepWalk, Node2vec and Mashup) that were applied on one or multiple protein networks. Obtained features were learned by one machine learning algorithm (support vector machine or random forest) to construct the model. The cross-validation method was adopted to evaluate all constructed models. Results: After evaluating models with the cross-validation method, embedding features yielded by Mashup on multiple networks were quite informative for predicting protein subcellular location. The model based on these features were superior to some classic models. Conclusion: Embedding features yielded by a proper and powerful network embedding algorithm were effective for building the model for prediction of protein subcellular location, providing new pipelines to build more efficient models.


Author(s):  
Jae Young Lee ◽  
Martin Röösli ◽  
Martina S. Ragettli

This study presents a novel method for estimating the heat-attributable fractions (HAF) based on the cross-validated best temperature metric. We analyzed the association of eight temperature metrics (mean, maximum, minimum temperature, maximum temperature during daytime, minimum temperature during nighttime, and mean, maximum, and minimum apparent temperature) with mortality and performed the cross-validation method to select the best model in selected cities of Switzerland and South Korea from May to September of 1995–2015. It was observed that HAF estimated using different metrics varied by 2.69–4.09% in eight cities of Switzerland and by 0.61–0.90% in six cities of South Korea. Based on the cross-validation method, mean temperature was estimated to be the best metric, and it revealed that the HAF of Switzerland and South Korea were 3.29% and 0.72%, respectively. Furthermore, estimates of HAF were improved by selecting the best city-specific model for each city, that is, 3.34% for Switzerland and 0.78% for South Korea. To the best of our knowledge, this study is the first to observe the uncertainty of HAF estimation originated from the selection of temperature metric and to present the HAF estimation based on the cross-validation method.


2015 ◽  
Vol 9 (1) ◽  
pp. 107-114
Author(s):  
Zhou Shengquan ◽  
Zhao Xiaolong ◽  
Yao Zhaoming

In order to forecast the displacement of deep foundation pit support, this document proposes a new method which combines the cross validation method and supports vector machine (SVM) based on random small samples. Because the random small monitoring data are difficult to fit and forecast, the cross validation method and different kernel function of support vector machine algorithm arerepeatedly used to establish and optimize the displacement prediction model of underground continuous wall, and then uses validation samples to test the accuracy of the models. The results show that this method can meet the requirements of precision relatively well, and Cauchy kernel function is better than the other. In the aspect of accuracy of model fitting and prediction, this method has great advantages, which can be applied to practical engineering.


2021 ◽  
Vol 15 (7) ◽  
pp. 3135-3157
Author(s):  
Jan-Hendrik Malles ◽  
Ben Marzeion

Abstract. Negative glacier mass balances in most of Earth's glacierized regions contribute roughly one-quarter to currently observed rates of sea-level rise and have likely contributed an even larger fraction during the 20th century. The distant past and future of glaciers' mass balances, and hence their contribution to sea-level rise, can only be estimated using numerical models. Since, independent of complexity, models always rely on some form of parameterizations and a choice of boundary conditions, a need for optimization arises. In this work, a model for computing monthly mass balances of glaciers on the global scale was forced with nine different data sets of near-surface air temperature and precipitation anomalies, as well as with their mean and median, leading to a total of 11 different forcing data sets. The goal is to better constrain the glaciers' 20th century sea-level budget contribution and its uncertainty. Therefore, five global parameters of the model's mass balance equations were varied systematically, within physically plausible ranges, for each forcing data set. We then identified optimal parameter combinations by cross-validating the model results against in situ annual specific mass balance observations, using three criteria: model bias, temporal correlation, and the ratio between the observed and modeled temporal standard deviation of specific mass balances. These criteria were chosen in order not to trade lower error estimates by means of the root mean squared error (RMSE) for an unrealistic interannual variability. We find that the disagreement between the different optimized model setups (i.e., ensemble members) is often larger than the uncertainties obtained via the leave-one-glacier-out cross-validation, particularly in times and places where few or no validation data are available, such as the first half of the 20th century. We show that the reason for this is that in regions where mass balance observations are abundant, the meteorological data are also better constrained, such that the cross-validation procedure only partly captures the uncertainty of the glacier model. For this reason, ensemble spread is introduced as an additional estimate of reconstruction uncertainty, increasing the total uncertainty compared to the model uncertainty merely obtained by the cross-validation. Our ensemble mean estimate indicates a sea-level contribution by global glaciers (outside of the ice sheets; including the Greenland periphery but excluding the Antarctic periphery) for 1901–2018 of 69.2 ± 24.3 mm sea-level equivalent (SLE), or 0.59 ± 0.21 mm SLE yr−1. While our estimates lie within the uncertainty range of most of the previously published global estimates, they agree less with those derived from GRACE data, which only cover the years 2002–2018.


2017 ◽  
Vol 33 (4) ◽  
pp. 543-549 ◽  
Author(s):  
Bernardo Gomes Nörenberg ◽  
Lessandro Coll Faria ◽  
Osvaldo Rettore Neto ◽  
Samuel Beskow ◽  
Alberto Colombo ◽  
...  

Abstract. In order to develop models for representation of Christiansen’s Uniformity (CU) and Distribution Uniformity (DU) as a function of wind speed, 32 in-field tests evaluating a mechanical lateral-move irrigation system, used in rice production, were carried out in southern Rio Grande do Sul, Brazil. These tests were used to generate two third-order polynomial models for estimation of CU and DU, which were then validated based on a cross-validation approach. The generated models had their accuracy quantified by means of the following statistical measures: determination coefficient (R2), reliability and performance index (c), root mean square error (RMSE), and Nash-Sutcliffe coefficient (CNS). Wind direction had no significant influence on CU and DU. The CU values estimated from the cross-validation method were compared to those observed, resulting in R2 = 0.44, c = 0.53, RMSE = 1.82%, and CNS = 0.43. Likewise, DU values estimated from the cross-validation method were compared to the observed values, culminating in R2, c, RMSE, and CNS equal to 0.41%, 0.51%, 2.81% and 0.40%, respectively. The models developed in this study can be useful as a support tool for decision making when applying mechanical lateral-move irrigation systems, allowing estimation of CU and DU values with satisfactory precision for wind speeds less than 5.5 m s-1. Keywords: In-field tests, Rice, Sprinkler irrigation.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Xin Liu ◽  
Liang Wang ◽  
Jian Li ◽  
Junfeng Hu ◽  
Xiao Zhang

Abstract Background Malonylation is a recently discovered post-translational modification that is associated with a variety of diseases such as Type 2 Diabetes Mellitus and different types of cancers. Compared with experimental identification of malonylation sites, computational method is a time-effective process with comparatively low costs. Results In this study, we proposed a novel computational model called Mal-Prec (Malonylation Prediction) for malonylation site prediction through the combination of Principal Component Analysis and Support Vector Machine. One-hot encoding, physio-chemical properties, and composition of k-spaced acid pairs were initially performed to extract sequence features. PCA was then applied to select optimal feature subsets while SVM was adopted to predict malonylation sites. Five-fold cross-validation results showed that Mal-Prec can achieve better prediction performance compared with other approaches. AUC (area under the receiver operating characteristic curves) analysis achieved 96.47 and 90.72% on 5-fold cross-validation of independent data sets, respectively. Conclusion Mal-Prec is a computationally reliable method for identifying malonylation sites in protein sequences. It outperforms existing prediction tools and can serve as a useful tool for identifying and discovering novel malonylation sites in human proteins. Mal-Prec is coded in MATLAB and is publicly available at https://github.com/flyinsky6/Mal-Prec, together with the data sets used in this study.


Sign in / Sign up

Export Citation Format

Share Document