Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/362 ◽

2021 ◽

Author(s):

Taehyeon Kim ◽

Jaehoon Oh ◽

Nak Yil Kim ◽

Sangwook Cho ◽

Se-Young Yun

Keyword(s):

Mean Squared Error ◽

Probability Distributions ◽

Student Model ◽

Kl Divergence ◽

Squared Error ◽

Leibler Divergence ◽

Temperature Scaling ◽

Knowledge Distillation ◽

The Mean ◽

Teacher Model

Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures. Generally, the objective function of KD is the Kullback-Leibler (KL) divergence loss between the softened probability distributions of the teacher model and the student model with the temperature scaling hyperparameter τ. Despite its widespread use, few studies have discussed how such softening influences generalization. Here, we theoretically show that the KL divergence loss focuses on the logit matching when τ increases and the label matching when τ goes to 0 and empirically show that the logit matching is positively correlated to performance improvement in general. From this observation, we consider an intuitive KD loss function, the mean squared error (MSE) between the logit vectors, so that the student model can directly learn the logit of the teacher model. The MSE loss outperforms the KL divergence loss, explained by the penultimate layer representations difference between the two losses. Furthermore, we show that sequential distillation can improve performance and that KD, using the KL divergence loss with small τ particularly, mitigates the label noise. The code to reproduce the experiments is publicly available online at https://github.com/jhoon-oh/kd_data/.

Download Full-text

The Mean Squared Error of the Instrumental Variables Estimator When the Disturbance Has an Elliptical Distribution

Econometric Reviews ◽

10.1080/07474930500545488 ◽

2006 ◽

Vol 25 (1) ◽

pp. 117-138 ◽

Cited By ~ 1

Author(s):

Fernanda P. M. Peixe ◽

Alastair R. Hall ◽

Kostas Kyriakoulis

Keyword(s):

Instrumental Variables ◽

Mean Squared Error ◽

Elliptical Distribution ◽

Squared Error ◽

The Mean

Download Full-text

Computing trade-offs in robust design: Perspectives of the mean squared error

Computers & Industrial Engineering ◽

10.1016/j.cie.2010.11.006 ◽

2011 ◽

Vol 60 (2) ◽

pp. 248-255 ◽

Cited By ~ 30

Author(s):

Sangmun Shin ◽

Funda Samanlioglu ◽

Byung Rae Cho ◽

Margaret M. Wiecek

Keyword(s):

Robust Design ◽

Mean Squared Error ◽

Squared Error ◽

Trade Offs ◽

The Mean

Download Full-text

Day-Ahead Forecasting of Hourly Photovoltaic Power Based on Robust Multilayer Perception

Sustainability ◽

10.3390/su10124863 ◽

2018 ◽

Vol 10 (12) ◽

pp. 4863 ◽

Cited By ~ 6

Author(s):

Chao Huang ◽

Longpeng Cao ◽

Nanxin Peng ◽

Sijia Li ◽

Jing Zhang ◽

...

Keyword(s):

Power Plants ◽

Mean Squared Error ◽

Absolute Error ◽

Multilayer Perception ◽

Squared Error ◽

The Mean ◽

Effectiveness And Efficiency ◽

Mlp Network ◽

Grid Operation ◽

Better Than

Photovoltaic (PV) modules convert renewable and sustainable solar energy into electricity. However, the uncertainty of PV power production brings challenges for the grid operation. To facilitate the management and scheduling of PV power plants, forecasting is an essential technique. In this paper, a robust multilayer perception (MLP) neural network was developed for day-ahead forecasting of hourly PV power. A generic MLP is usually trained by minimizing the mean squared loss. The mean squared error is sensitive to a few particularly large errors that can lead to a poor estimator. To tackle the problem, the pseudo-Huber loss function, which combines the best properties of squared loss and absolute loss, was adopted in this paper. The effectiveness and efficiency of the proposed method was verified by benchmarking against a generic MLP network with real PV data. Numerical experiments illustrated that the proposed method performed better than the generic MLP network in terms of root mean squared error (RMSE) and mean absolute error (MAE).

Download Full-text

On double stage minimax-shrinkage estimator for generalized Rayleigh model

International Journal of Applied Mathematical Research ◽

10.14419/ijamr.v5i1.5553 ◽

2016 ◽

Vol 5 (1) ◽

pp. 39 ◽

Cited By ~ 1

Author(s):

Abbas Najim Salman ◽

Maymona Ameen

Keyword(s):

Sample Size ◽

Shape Parameter ◽

Mean Squared Error ◽

Scale Parameter ◽

Rayleigh Distribution ◽

Shrinkage Estimator ◽

Squared Error ◽

Expected Sample Size ◽

Generalized Rayleigh Distribution ◽

The Mean

This paper is concerned with minimax shrinkage estimator using double stage shrinkage technique for lowering the mean squared error, intended for estimate the shape parameter (a) of Generalized Rayleigh distribution in a region (R) around available prior knowledge (a0) about the actual value (a) as initial estimate in case when the scale parameter (l) is known .In situation where the experimentations are time consuming or very costly, a double stage procedure can be used to reduce the expected sample size needed to obtain the estimator.The proposed estimator is shown to have smaller mean squared error for certain choice of the shrinkage weight factor y(×) and suitable region R.Expressions for Bias, Mean squared error (MSE), Expected sample size [E (n/a, R)], Expected sample size proportion [E(n/a,R)/n], probability for avoiding the second sample and percentage of overall sample saved for the proposed estimator are derived.Numerical results and conclusions for the expressions mentioned above were displayed when the consider estimator are testimator of level of significanceD.Comparisons with the minimax estimator and with the most recent studies were made to shown the effectiveness of the proposed estimator.

Download Full-text

Performance Analysis of AOA-Based Localization Using the LS Approach: Explicit Expression of Mean-Squared Error

Journal of Sensors ◽

10.1155/2020/9346142 ◽

2020 ◽

Vol 2020 ◽

pp. 1-22

Author(s):

Byung-Kwon Son ◽

Do-Jin An ◽

Joon-Ho Lee

Keyword(s):

Explicit Expression ◽

Mean Squared Error ◽

Weighted Least Squares ◽

Localization Algorithm ◽

Angle Of Arrival ◽

Squared Error ◽

Distance Weighted ◽

The Mean ◽

Passive Localization ◽

Location Estimate

In this paper, a passive localization of the emitter using noisy angle-of-arrival (AOA) measurements, called Brown DWLS (Distance Weighted Least Squares) algorithm, is considered. The accuracy of AOA-based localization is quantified by the mean-squared error. Various estimates of the AOA-localization algorithm have been derived (Doğançay and Hmam, 2008). Explicit expression of the location estimate of the previous study is used to get an analytic expression of the mean-squared error (MSE) of one of the various estimates. To validate the derived expression, we compare the MSE from the Monte Carlo simulation with the analytically derived MSE.

Download Full-text

Estimation of critical torque using intermittent isometric maximal voluntary contractions of the quadriceps in humans

Journal of Applied Physiology ◽

10.1152/japplphysiol.91474.2008 ◽

2009 ◽

Vol 106 (3) ◽

pp. 975-983 ◽

Cited By ~ 67

Author(s):

Mark Burnley

Keyword(s):

Laboratory Tests ◽

Mean Squared Error ◽

Voluntary Activation ◽

Peripheral Fatigue ◽

Contraction Time ◽

Twitch Interpolation ◽

Squared Error ◽

The Mean ◽

Duration Relationship ◽

Voluntary Contractions

To determine whether the asymptote of the torque-duration relationship (critical torque) could be estimated from the torque measured at the end of a series of maximal voluntary contractions (MVCs) of the quadriceps, eight healthy men performed eight laboratory tests. Following familiarization, subjects performed two tests in which they were required to perform 60 isometric MVCs over a period of 5 min (3 s contraction, 2 s rest), and five tests involving intermittent isometric contractions at ∼35–60% MVC, each performed to task failure. Critical torque was determined using linear regression of the torque impulse and contraction time during the submaximal tests, and the end-test torque during the MVCs was calculated from the mean of the last six contractions of the test. During the MVCs voluntary torque declined from 263.9 ± 44.6 to 77.8 ± 17.8 N·m. The end-test torque was not different from the critical torque (77.9 ± 15.9 N·m; 95% paired-sample confidence interval, −6.5 to 6.2 N·m). The root mean squared error of the estimation of critical torque from the end-test torque was 7.1 N·m. Twitch interpolation showed that voluntary activation declined from 90.9 ± 6.5% to 66.9 ± 13.1% ( P < 0.001), and the potentiated doublet response declined from 97.7 ± 23.0 to 46.9 ± 6.7 N·m ( P < 0.001) during the MVCs, indicating the development of both central and peripheral fatigue. These data indicate that fatigue during 5 min of intermittent isometric MVCs of the quadriceps leads to an end-test torque that closely approximates the critical torque.

Download Full-text

ENHANCEMENT OF MAMMOGRAPHIC PHANTOM FEATURES BY NOISE REDUCTION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005788 ◽

2007 ◽

Vol 21 (06) ◽

pp. 1047-1057

Author(s):

MOULOUD ADEL ◽

DANIEL ZUWALA ◽

MONIQUE RASIGNI ◽

SALAH BOURENNANE

Keyword(s):

Noise Reduction ◽

Mean Squared Error ◽

Point Of View ◽

Reduction Scheme ◽

Squared Error ◽

Modification Method ◽

Optimal Function ◽

Direct Contrast ◽

The Mean ◽

Simulated Images

A noise reduction scheme on digitized mammographic phantom images is presented. This algorithm is based on a direct contrast modification method with an optimal function, obtained by using the mean squared error as a criterion. Computer simulated images containing objects similar to those observed in the phantom are built to evaluate the performance of the algorithm. Noise reduction results obtained on both simulated and real phantom images show that the developed method may be considered as a good preprocessing step from the point of view of automating phantom film evaluation by means of image processing.

Download Full-text

A Theoretical Framework for Estimating Swarm Success Probability Using Scouts

International Journal of Swarm Intelligence Research ◽

10.4018/jsir.2010100102 ◽

2010 ◽

Vol 1 (4) ◽

pp. 17-45

Author(s):

Antons Rebguns ◽

Diana F. Spears ◽

Richard Anderson-Sprecher ◽

Aleksey Kletsov

Keyword(s):

Mean Squared Error ◽

Success Probability ◽

Ground Truth ◽

Theoretical Framework ◽

Bernoulli Trials ◽

Lennard Jones ◽

Squared Error ◽

The Mean ◽

The Bayesian Approach ◽

Bayesian Formula

This paper presents a novel theoretical framework for swarms of agents. Before deploying a swarm for a task, it is advantageous to predict whether a desired percentage of the swarm will succeed. The authors present a framework that uses a small group of expendable “scout” agents to predict the success probability of the entire swarm, thereby preventing many agent losses. The scouts apply one of two formulas to predict – the standard Bernoulli trials formula or the new Bayesian formula. For experimental evaluation, the framework is applied to simulated agents navigating around obstacles to reach a goal location. Extensive experimental results compare the mean-squared error of the predictions of both formulas with ground truth, under varying circumstances. Results indicate the accuracy and robustness of the Bayesian approach. The framework also yields an intriguing result, namely, that both formulas usually predict better in the presence of (Lennard-Jones) inter-agent forces than when their independence assumptions hold.

Download Full-text

The mean squared error of the likelihood based estimating equations in the general linear model

Model Assisted Statistics and Applications ◽

10.3233/mas-150326 ◽

2015 ◽

Vol 10 (3) ◽

pp. 221-229

Author(s):

Munir Mahmood

Keyword(s):

Linear Model ◽

Mean Squared Error ◽

Estimating Equations ◽

General Linear Model ◽

General Linear ◽

Squared Error ◽

The Mean

Download Full-text

Designing accurate emulators for scientific processes using calibration-driven deep models

Nature Communications ◽

10.1038/s41467-020-19448-8 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Jayaraman J. Thiagarajan ◽

Bindya Venkatesh ◽

Rushil Anirudh ◽

Peer-Timo Bremer ◽

Jim Gaffney ◽

...

Keyword(s):

Mean Squared Error ◽

Mean Absolute Error ◽

Absolute Error ◽

Heterogeneous Data ◽

Small Data ◽

Machine Learning Methods ◽

Squared Error ◽

Noise Structure ◽

The Mean ◽

Modern Machine

Abstract Predictive models that accurately emulate complex scientific processes can achieve speed-ups over numerical simulators or experiments and at the same time provide surrogates for improving the subsequent analysis. Consequently, there is a recent surge in utilizing modern machine learning methods to build data-driven emulators. In this work, we study an often overlooked, yet important, problem of choosing loss functions while designing such emulators. Popular choices such as the mean squared error or the mean absolute error are based on a symmetric noise assumption and can be unsuitable for heterogeneous data or asymmetric noise distributions. We propose Learn-by-Calibrating, a novel deep learning approach based on interval calibration for designing emulators that can effectively recover the inherent noise structure without any explicit priors. Using a large suite of use-cases, we demonstrate the efficacy of our approach in providing high-quality emulators, when compared to widely-adopted loss function choices, even in small-data regimes.

Download Full-text