scholarly journals Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation

Author(s):  
Taehyeon Kim ◽  
Jaehoon Oh ◽  
Nak Yil Kim ◽  
Sangwook Cho ◽  
Se-Young Yun

Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures. Generally, the objective function of KD is the Kullback-Leibler (KL) divergence loss between the softened probability distributions of the teacher model and the student model with the temperature scaling hyperparameter τ. Despite its widespread use, few studies have discussed how such softening influences generalization. Here, we theoretically show that the KL divergence loss focuses on the logit matching when τ increases and the label matching when τ goes to 0 and empirically show that the logit matching is positively correlated to performance improvement in general. From this observation, we consider an intuitive KD loss function, the mean squared error (MSE) between the logit vectors, so that the student model can directly learn the logit of the teacher model. The MSE loss outperforms the KL divergence loss, explained by the penultimate layer representations difference between the two losses. Furthermore, we show that sequential distillation can improve performance and that KD, using the KL divergence loss with small τ particularly, mitigates the label noise. The code to reproduce the experiments is publicly available online at https://github.com/jhoon-oh/kd_data/.

2006 ◽  
Vol 25 (1) ◽  
pp. 117-138 ◽  
Author(s):  
Fernanda P. M. Peixe ◽  
Alastair R. Hall ◽  
Kostas Kyriakoulis

2011 ◽  
Vol 60 (2) ◽  
pp. 248-255 ◽  
Author(s):  
Sangmun Shin ◽  
Funda Samanlioglu ◽  
Byung Rae Cho ◽  
Margaret M. Wiecek

2018 ◽  
Vol 10 (12) ◽  
pp. 4863 ◽  
Author(s):  
Chao Huang ◽  
Longpeng Cao ◽  
Nanxin Peng ◽  
Sijia Li ◽  
Jing Zhang ◽  
...  

Photovoltaic (PV) modules convert renewable and sustainable solar energy into electricity. However, the uncertainty of PV power production brings challenges for the grid operation. To facilitate the management and scheduling of PV power plants, forecasting is an essential technique. In this paper, a robust multilayer perception (MLP) neural network was developed for day-ahead forecasting of hourly PV power. A generic MLP is usually trained by minimizing the mean squared loss. The mean squared error is sensitive to a few particularly large errors that can lead to a poor estimator. To tackle the problem, the pseudo-Huber loss function, which combines the best properties of squared loss and absolute loss, was adopted in this paper. The effectiveness and efficiency of the proposed method was verified by benchmarking against a generic MLP network with real PV data. Numerical experiments illustrated that the proposed method performed better than the generic MLP network in terms of root mean squared error (RMSE) and mean absolute error (MAE).


2016 ◽  
Vol 5 (1) ◽  
pp. 39 ◽  
Author(s):  
Abbas Najim Salman ◽  
Maymona Ameen

<p>This paper is concerned with minimax shrinkage estimator using double stage shrinkage technique for lowering the mean squared error, intended for estimate the shape parameter (a) of Generalized Rayleigh distribution in a region (R) around available prior knowledge (a<sub>0</sub>) about the actual value (a) as initial estimate in case when the scale parameter (l) is known .</p><p>In situation where the experimentations are time consuming or very costly, a double stage procedure can be used to reduce the expected sample size needed to obtain the estimator.</p><p>The proposed estimator is shown to have smaller mean squared error for certain choice of the shrinkage weight factor y(<strong>×</strong>) and suitable region R.</p><p>Expressions for Bias, Mean squared error (MSE), Expected sample size [E (n/a, R)], Expected sample size proportion [E(n/a,R)/n], probability for avoiding the second sample and percentage of overall sample saved  for the proposed estimator are derived.</p><p>Numerical results and conclusions for the expressions mentioned above were displayed when the consider estimator are testimator of level of significanceD.</p><p>Comparisons with the minimax estimator and with the most recent studies were made to shown the effectiveness of the proposed estimator.</p>


2020 ◽  
Vol 2020 ◽  
pp. 1-22
Author(s):  
Byung-Kwon Son ◽  
Do-Jin An ◽  
Joon-Ho Lee

In this paper, a passive localization of the emitter using noisy angle-of-arrival (AOA) measurements, called Brown DWLS (Distance Weighted Least Squares) algorithm, is considered. The accuracy of AOA-based localization is quantified by the mean-squared error. Various estimates of the AOA-localization algorithm have been derived (Doğançay and Hmam, 2008). Explicit expression of the location estimate of the previous study is used to get an analytic expression of the mean-squared error (MSE) of one of the various estimates. To validate the derived expression, we compare the MSE from the Monte Carlo simulation with the analytically derived MSE.


2009 ◽  
Vol 106 (3) ◽  
pp. 975-983 ◽  
Author(s):  
Mark Burnley

To determine whether the asymptote of the torque-duration relationship (critical torque) could be estimated from the torque measured at the end of a series of maximal voluntary contractions (MVCs) of the quadriceps, eight healthy men performed eight laboratory tests. Following familiarization, subjects performed two tests in which they were required to perform 60 isometric MVCs over a period of 5 min (3 s contraction, 2 s rest), and five tests involving intermittent isometric contractions at ∼35–60% MVC, each performed to task failure. Critical torque was determined using linear regression of the torque impulse and contraction time during the submaximal tests, and the end-test torque during the MVCs was calculated from the mean of the last six contractions of the test. During the MVCs voluntary torque declined from 263.9 ± 44.6 to 77.8 ± 17.8 N·m. The end-test torque was not different from the critical torque (77.9 ± 15.9 N·m; 95% paired-sample confidence interval, −6.5 to 6.2 N·m). The root mean squared error of the estimation of critical torque from the end-test torque was 7.1 N·m. Twitch interpolation showed that voluntary activation declined from 90.9 ± 6.5% to 66.9 ± 13.1% ( P < 0.001), and the potentiated doublet response declined from 97.7 ± 23.0 to 46.9 ± 6.7 N·m ( P < 0.001) during the MVCs, indicating the development of both central and peripheral fatigue. These data indicate that fatigue during 5 min of intermittent isometric MVCs of the quadriceps leads to an end-test torque that closely approximates the critical torque.


Author(s):  
MOULOUD ADEL ◽  
DANIEL ZUWALA ◽  
MONIQUE RASIGNI ◽  
SALAH BOURENNANE

A noise reduction scheme on digitized mammographic phantom images is presented. This algorithm is based on a direct contrast modification method with an optimal function, obtained by using the mean squared error as a criterion. Computer simulated images containing objects similar to those observed in the phantom are built to evaluate the performance of the algorithm. Noise reduction results obtained on both simulated and real phantom images show that the developed method may be considered as a good preprocessing step from the point of view of automating phantom film evaluation by means of image processing.


2010 ◽  
Vol 1 (4) ◽  
pp. 17-45
Author(s):  
Antons Rebguns ◽  
Diana F. Spears ◽  
Richard Anderson-Sprecher ◽  
Aleksey Kletsov

This paper presents a novel theoretical framework for swarms of agents. Before deploying a swarm for a task, it is advantageous to predict whether a desired percentage of the swarm will succeed. The authors present a framework that uses a small group of expendable “scout” agents to predict the success probability of the entire swarm, thereby preventing many agent losses. The scouts apply one of two formulas to predict – the standard Bernoulli trials formula or the new Bayesian formula. For experimental evaluation, the framework is applied to simulated agents navigating around obstacles to reach a goal location. Extensive experimental results compare the mean-squared error of the predictions of both formulas with ground truth, under varying circumstances. Results indicate the accuracy and robustness of the Bayesian approach. The framework also yields an intriguing result, namely, that both formulas usually predict better in the presence of (Lennard-Jones) inter-agent forces than when their independence assumptions hold.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Jayaraman J. Thiagarajan ◽  
Bindya Venkatesh ◽  
Rushil Anirudh ◽  
Peer-Timo Bremer ◽  
Jim Gaffney ◽  
...  

Abstract Predictive models that accurately emulate complex scientific processes can achieve speed-ups over numerical simulators or experiments and at the same time provide surrogates for improving the subsequent analysis. Consequently, there is a recent surge in utilizing modern machine learning methods to build data-driven emulators. In this work, we study an often overlooked, yet important, problem of choosing loss functions while designing such emulators. Popular choices such as the mean squared error or the mean absolute error are based on a symmetric noise assumption and can be unsuitable for heterogeneous data or asymmetric noise distributions. We propose Learn-by-Calibrating, a novel deep learning approach based on interval calibration for designing emulators that can effectively recover the inherent noise structure without any explicit priors. Using a large suite of use-cases, we demonstrate the efficacy of our approach in providing high-quality emulators, when compared to widely-adopted loss function choices, even in small-data regimes.


Sign in / Sign up

Export Citation Format

Share Document