Null hypothesis significance testing and effect sizes: can we ‘effect’ everything … or … anything?

Most questions across science call for quantitative answers, ideally, a single best estimate plus information about the precision of that estimate. A confidence interval (CI) expresses both efficiently. Early experimental psychologists sought quantitative answers, but for the last half century psychology has been dominated by the nonquantitative, dichotomous thinking of null hypothesis significance testing (NHST). The authors argue that psychology should rejoin mainstream science by asking better questions – those that demand quantitative answers – and using CIs to answer them. They explain CIs and a range of ways to think about them and use them to interpret data, especially by considering CIs as prediction intervals, which provide information about replication. They explain how to calculate CIs on means, proportions, correlations, and standardized effect sizes, and illustrate symmetric and asymmetric CIs. They also argue that information provided by CIs is more useful than that provided by p values, or by values of Killeen’s prep, the probability of replication.

Download Full-text

Things We Still Haven’t Learned (So Far)

Journal of Sport and Exercise Psychology ◽

10.1123/jsep.2015-0015 ◽

2015 ◽

Vol 37 (4) ◽

pp. 449-461 ◽

Cited By ~ 19

Author(s):

Andreas Ivarsson ◽

Mark B. Andersen ◽

Andreas Stenling ◽

Urban Johnson ◽

Magnus Lindwall

Keyword(s):

Bayesian Statistics ◽

Null Hypothesis ◽

Historical Background ◽

Effect Sizes ◽

Significance Testing ◽

Exercise Psychology ◽

Null Hypothesis Significance Testing ◽

The Past ◽

The Common ◽

Sport And Exercise

Null hypothesis significance testing (NHST) is like an immortal horse that some researchers have been trying to beat to death for over 50 years but without any success. In this article we discuss the flaws in NHST, the historical background in relation to both Fisher’s and Neyman and Pearson’s statistical ideas, the common misunderstandings of what p < 05 actually means, and the 2010 APA publication manual’s clear, but most often ignored, instructions to report effect sizes and to interpret what they all mean in the real world. In addition, we discuss how Bayesian statistics can be used to overcome some of the problems with NHST. We then analyze quantitative articles published over the past three years (2012–2014) in two top-rated sport and exercise psychology journals to determine whether we have learned what we should have learned decades ago about our use and meaningful interpretations of statistics.

Download Full-text

The null hypothesis is always rejected with statistical tricks: Why do you need it?

Revista Interamericana de Psicología/Interamerican Journal of Psychology ◽

10.30849/rip/ijp.v53i1.1166 ◽

2019 ◽

Vol 53 (1) ◽

pp. 17-27

Author(s):

Freddy A. Paniagua

Keyword(s):

Null Hypothesis ◽

Statistical Significance ◽

Effect Sizes ◽

Practical Significance ◽

Significance Testing ◽

Behavioral Sciences ◽

Null Hypothesis Significance Testing ◽

Very High

Ferguson (2015) observed that the proportion of studies supporting the experimental hypothesis and rejecting the null hypothesis is very high. This paper argues that the reason for this scenario is that researchers in the behavioral sciences have learned that the null hypothesis can always be rejected if one knows the statistical tricks to reject it (e.g., the probability of rejecting the null hypothesis increases with p = 0.05 compare to p = 0.01). Examples of the advancement of science without the need to formulate the null hypothesis are also discussed, as well as alternatives to null hypothesis significance testing-NHST (e.g., effect sizes), and the importance to distinguish the statistical significance from the practical significance of results.

Download Full-text

Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values

Invasive Plant Science and Management ◽

10.1614/ipsm-09-038.1 ◽

2010 ◽

Vol 3 (2) ◽

pp. 106-112 ◽

Cited By ~ 17

Author(s):

Matthew J. Rinella ◽

Jeremy J. James

Keyword(s):

Confidence Intervals ◽

Null Hypothesis ◽

Invasive Plant ◽

Effect Sizes ◽

Significance Testing ◽

Research Articles ◽

Plant Science ◽

Null Hypothesis Significance Testing ◽

P Values ◽

Size Estimates

AbstractNull hypothesis significance testing (NHST) forms the backbone of statistical inference in invasive plant science. Over 95% of research articles in Invasive Plant Science and Management report NHST results such as P-values or statistics closely related to P-values such as least significant differences. Unfortunately, NHST results are less informative than their ubiquity implies. P-values are hard to interpret and are regularly misinterpreted. Also, P-values do not provide estimates of the magnitudes and uncertainties of studied effects, and these effect size estimates are what invasive plant scientists care about most. In this paper, we reanalyze four datasets (two of our own and two of our colleagues; studies put forth as examples in this paper are used with permission of their authors) to illustrate limitations of NHST. The re-analyses are used to build a case for confidence intervals as preferable alternatives to P-values. Confidence intervals indicate effect sizes, and compared to P-values, confidence intervals provide more complete, intuitively appealing information on what data do/do not indicate.

Download Full-text

When the Numbers Do Not Add Up: The Practical Limits of Stochastologicals for Soft Psychology

Perspectives on Psychological Science ◽

10.1177/1745691620970557 ◽

2021 ◽

pp. 174569162097055

Author(s):

Nick J. Broers

Keyword(s):

High Risk ◽

Null Hypothesis ◽

Meta Analysis ◽

Effect Sizes ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

Psychological Science ◽

Body Of Knowledge ◽

Psychological Theories ◽

Time Continuum

One particular weakness of psychology that was left implicit by Meehl is the fact that psychological theories tend to be verbal theories, permitting at best ordinal predictions. Such predictions do not enable the high-risk tests that would strengthen our belief in the verisimilitude of theories but instead lead to the practice of null-hypothesis significance testing, a practice Meehl believed to be a major reason for the slow theoretical progress of soft psychology. The rising popularity of meta-analysis has led some to argue that we should move away from significance testing and focus on the size and stability of effects instead. Proponents of this reform assume that a greater emphasis on quantity can help psychology to develop a cumulative body of knowledge. The crucial question in this endeavor is whether the resulting numbers really have theoretical meaning. Psychological science lacks an undisputed, preexisting domain of observations analogous to the observations in the space-time continuum in physics. It is argued that, for this reason, effect sizes do not really exist independently of the adopted research design that led to their manifestation. Consequently, they can have no bearing on the verisimilitude of a theory.

Download Full-text

Scientific Self-Correction: The Bayesian Way

10.31234/osf.io/daw3q ◽

2019 ◽

Author(s):

Felipe Romero ◽

Jan Sprenger

Keyword(s):

Bayesian Inference ◽

Bayesian Statistics ◽

Experimental Research ◽

Null Hypothesis ◽

Effect Sizes ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

Scientific Disciplines ◽

Replication Crisis ◽

Reliable Knowledge

The enduring replication crisis in many scientific disciplines casts doubt on the ability of science to self-correct its findings and to produce reliable knowledge. Amongst a variety of possible methodological, social, and statistical reforms to address the crisis, we focus on replacing null hypothesis significance testing (NHST) with Bayesian inference. On the basis of a simulation study for meta-analytic aggregation of effect sizes, we study the relative advantages of this Bayesian reform, and its interaction with widespread limitations in experimental research. Moving to Bayesian statistics will not solve the replication crisis single-handely, but would eliminate important sources of effect size overestimation for the conditions we study.

Download Full-text

The Use and Misuse of Classical Statistics: A Primer for Social Workers

Research on Social Work Practice ◽

10.1177/10497315211008247 ◽

2021 ◽

pp. 104973152110082

Author(s):

Daniel J. Dunleavy ◽

Jeffrey R. Lacasse

Keyword(s):

Social Workers ◽

Null Hypothesis ◽

Effect Sizes ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

P Values ◽

Research And Practice ◽

Classical Statistics ◽

Frequentist Statistics ◽

Common Misconceptions

In this article, we offer a primer on “classical” frequentist statistics. In doing so, we aim to (1) provide social workers with a nuanced overview of common statistical concepts and tools, (2) clarify ways in which these ideas have oft been misused or misinterpreted in research and practice, and (3) help social workers better understand what frequentist statistics can and cannot offer. We begin broadly, starting with foundational issues in the philosophy of statistics. Then, we outline the Fisherian and Neyman–Pearson approaches to statistical inference and the practice of null hypothesis significance testing. We then discuss key statistical concepts including α, power, p values, effect sizes, and confidence intervals, exploring several common misconceptions about their use and interpretation. We close by considering some limitations of frequentist statistics and by offering an opinionated discussion on how social workers may promote more fruitful, responsible, and thoughtful statistical practice.

Download Full-text

Null Hypothesis Significance Testing,p-values, Effects Sizes and Confidence Intervals

Brain Impairment ◽

10.1017/brimp.2017.28 ◽

2017 ◽

Vol 19 (1) ◽

pp. 70-80 ◽

Cited By ~ 6

Author(s):

Michael Perdices

Keyword(s):

Null Hypothesis ◽

American Psychological Association ◽

Effect Sizes ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

Paper Briefly ◽

Primary Interest ◽

American Psychological ◽

Web Based ◽

Neuropsychological Rehabilitation

There has been controversy over Null Hypothesis Significance Testing (NHST) since the first quarter of the 20th century and misconceptions about it still abound. The first section of this paper briefly discusses some of the problems and limitations of NHST. Overwhelmingly, the ‘holy grail’ of researchers has been to obtain significantp-values. In 1999 the American Psychological Association (APA) recommended that if NHST was used in data analysis, then researchers should report effect sizes (ESs) and their confident intervals (CIs) as well asp-values. The APA recommendations are summarised in the next section of the paper. But as neuropsychological rehabilitation clinicians, the primary interest is (or should be) to determine whether or not the effect of an intervention is clinically important, not just statistically significant. In this context, ESs and their CIs provide information relevant to clinicians. The next section of the paper reviews common ESs and worked out examples are provided for the calculation of three commonly used ES (Cohen'sd, Hedge'sgand Glass’delta). Web-based resources for calculating other ESs and their CIs are also reviewed.

Download Full-text

When the numbers do not add up: the practical limits of stochastologicals for soft psychology

10.31234/osf.io/nwybg ◽

2021 ◽

Author(s):

Nick J. Broers

Keyword(s):

High Risk ◽

Null Hypothesis ◽

Meta Analysis ◽

Effect Sizes ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

Psychological Science ◽

Body Of Knowledge ◽

Psychological Theories ◽

Time Continuum

One particular weakness of psychology that was left implicit by Meehl (1978) is the fact that psychological theories tend to be verbal theories, permitting at best ordinal predictions. Such predictions do not enable the high risk tests that would strengthen our belief in the verisimilitude of theories but instead lead to the practice of null hypothesis significance testing, a practice Meehl believed to be a major reason for the slow theoretical progress of soft psychology. The rising popularity of meta-analysis has led some to argue that we should move away from significance testing and focus on the size and stability of effects instead. Proponents of this reform assume that a greater emphasis on quantity can help psychology to develop a cumulative body of knowledge. The crucial question in this endeavor is whether the resulting numbers really have theoretical meaning. Psychological science lacks an undisputed, pre-existing domain of observations analogous to the observations in the space-time continuum in physics. It is argued that for this reason effect sizes do not really exist independently of the adopted research design that led to their manifestation. Consequently, they can have no bearing on the verisimilitude of a theory.

Download Full-text

The controversy of significance testing: misconceptions and alternatives

American Journal of Critical Care ◽

10.4037/ajcc1999.8.5.291 ◽

1999 ◽

Vol 8 (5) ◽

pp. 291-296 ◽

Cited By ~ 22

Author(s):

DN Glaser

Keyword(s):

Social Sciences ◽

Null Hypothesis ◽

Effect Sizes ◽

Significance Testing ◽

Current Debate ◽

Null Hypothesis Significance Testing ◽

P Values ◽

Current Thinking ◽

The Social ◽

Testing Approach

The current debate about the merits of null hypothesis significance testing, even though provocative, is not particularly novel. The significance testing approach has had defenders and opponents for decades, especially within the social sciences, where reliance on the use of significance testing has historically been heavy. The primary concerns have been (1) the misuse of significance testing, (2) the misinterpretation of P values, and (3) the lack of accompanying statistics, such as effect sizes and confidence intervals, that would provide a broader picture into the researcher's data analysis and interpretation. This article presents the current thinking, both in favor and against, on significance testing, the virtually unanimous support for reporting effect sizes alongside P values, and the overall implications for practice and application.

Download Full-text