scholarly journals Beyond reporting statistical significance: Identifying informative effect sizes to improve scientific communication

2019 ◽  
Vol 28 (4) ◽  
pp. 468-485 ◽  
Author(s):  
Paul HP Hanel ◽  
David MA Mehler

Transparent communication of research is key to foster understanding within and beyond the scientific community. An increased focus on reporting effect sizes in addition to p value–based significance statements or Bayes Factors may improve scientific communication with the general public. Across three studies ( N = 652), we compared subjective informativeness ratings for five effect sizes, Bayes Factor, and commonly used significance statements. Results showed that Cohen’s U3 was rated as most informative. For example, 440 participants (69%) found U3 more informative than Cohen’s d, while 95 (15%) found d more informative than U3, with 99 participants (16%) finding both effect sizes equally informative. This effect was not moderated by level of education. We therefore suggest that in general, Cohen’s U3 is used when scientific findings are communicated. However, the choice of the effect size may vary depending on what a researcher wants to highlight (e.g. differences or similarities).

2018 ◽  
Author(s):  
Paul H. P. Hanel ◽  
David Marc Anton Mehler

Transparent communication of research is key to foster understanding within and beyond the scientific community. Increased focus on reporting effect sizes in addition of p-value based significance statements may improve scientific communication with the general public. Across two studies (N = 446), we compared informativeness ratings for five effect sizes, Bayes Factor and commonly used significance statements. Results showed that Cohen’s U3 was rated as most informative. For example, 77% of participants found it more informative than Cohen’s d. We therefore suggest that Cohen’s U3 is used when scientific findings are communicated.


2021 ◽  
Author(s):  
Willem M Otte ◽  
Christiaan H Vinkers ◽  
Philippe Habets ◽  
David G P van IJzendoorn ◽  
Joeri K Tijdink

Abstract Objective To quantitatively map how non-significant outcomes are reported in randomised controlled trials (RCTs) over the last thirty years. Design Quantitative analysis of English full-texts containing 567,758 RCTs recorded in PubMed (81.5% of all published RCTs). Methods We determined the exact presence of 505 pre-defined phrases denoting results that do not reach formal statistical significance (P<0.05) in 567,758 RCT full texts between 1990 and 2020 and manually extracted associated P values. Phrase data was modeled with Bayesian linear regression. Evidence for temporal change was obtained through Bayes-factor analysis. In a randomly sampled subset, the associated P values were manually extracted. Results We identified 61,741 phrases indicating close to significant results in 49,134 (8.65%; 95% confidence interval (CI): 8.58–8.73) RCTs. The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being ‘marginally significant’ (in 7,735 RCTs), ‘all but significant’ (7,015), ‘a nonsignificant trend’ (3,442), ‘failed to reach statistical significance’ (2,578) and ‘a strong trend’ (1,700). The strongest evidence for a temporal prevalence increase was found for ‘a numerical trend’, ‘a positive trend’, ‘an increasing trend’ and ‘nominally significant’. The phrases ‘all but significant’, ‘approaches statistical significance’, ‘did not quite reach statistical significance’, ‘difference was apparent’, ‘failed to reach statistical significance’ and ‘not quite significant’ decreased over time. In the random sampled subset, the 11,926 identified P values ranged between 0.05 and 0.15 (68.1%; CI: 67.3–69.0; median 0.06). Conclusions Our results demonstrate that phrases describing marginally significant results are regularly used in RCTs to report P values close to but above the dominant 0.05 cut-off. The phrase prevalence remained stable over time, despite all efforts to change the focus from P < 0.05 to reporting effect sizes and corresponding confidence intervals. To improve transparency and enhance responsible interpretation of RCT results, researchers, clinicians, reviewers, and editors need to abandon the focus on formal statistical significance thresholds and stimulate reporting of exact P values with corresponding effect sizes and confidence intervals. Significance statement The power of language to modify the reader’s perception of how to interpret biomedical results cannot be underestimated. Misreporting and misinterpretation are urgent problems in RCT output. This may be at least partially related to the statistical paradigm of the 0.05 significance threshold. Sometimes, creativity and inventive strategies of clinical researchers may be used – describing their clinical results to be ‘almost significant’ – to get their data published. This phrasing may convince readers about the value of their work. Since 2005 there is an increasing concern that most current published research findings are false and it has been generally advised to switch from null hypothesis significance testing to using effect sizes, estimation, and cumulation of evidence. If this ‘new statistics’ approach has worked out well should be reflected in the phases describing non-significance results of RCTs. In particular in changing patterns describing P values just above 0.05 value. More than five hundred phrases potentially suited to report or discuss non-significant results were searched in over half a million published RCTs. A stable overall prevalence of these phrases (10.87%, CI: 10.79–10.96; N: 61,741), with associated P values close to 0.05, was found in the last three decades, with strong increases or decreases in individual phrases describing these near-significant results. The pressure to pass scientific peer-review barrier may function as an incentive to use effective phrases to mask non-significant results in RCTs. However, this keeps the researcher’s pre-occupied with hypothesis testing rather than presenting outcome estimations with uncertainty. The effect of language on getting RCT results published should ideally be minimal to steer evidence-based medicine away from overselling of research results, unsubstantiated claims about the efficacy of certain RCTs and to prevent an over-reliance on P value cutoffs. Our exhaustive search suggests that presenting RCT findings remains a struggle when P values approach the carved-in-stone threshold of 0.05.


1998 ◽  
Vol 21 (2) ◽  
pp. 221-222
Author(s):  
Louis G. Tassinary

Chow (1996) offers a reconceptualization of statistical significance that is reasoned and comprehensive. Despite a somewhat rough presentation, his arguments are compelling and deserve to be taken seriously by the scientific community. It is argued that his characterization of literal replication, types of research, effect size, and experimental control are in need of revision.


2021 ◽  
Author(s):  
Neil McLatchie ◽  
Manuela Thomae

Thomae and Viki (2013) reported that increased exposure to sexist humour can increase rape proclivity among males, specifically those who score high on measures of Hostile Sexism. Here we report two pre-registered direct replications (N = 530) of Study 2 from Thomae and Viki (2013) and assess replicability via (i) statistical significance, (ii) Bayes factors, (iii) the small-telescope approach, and (iv) an internal meta-analysis across the original and replication studies. The original results were not supported by any of the approaches. Combining the original study and the replications yielded moderate evidence in support of the null over the alternative hypothesis with a Bayes factor of B = 0.13. In light of the combined evidence, we encourage researchers to exercise caution before claiming that brief exposure to sexist humour increases male’s proclivity towards rape, until further pre-registered and open research demonstrates the effect is reliably reproducible.


Author(s):  
H. S. Styn ◽  
S. M. Ellis

The determination of significance of differences in means and of relationships between variables is of importance in many empirical studies. Usually only statistical significance is reported, which does not necessarily indicate an important (practically significant) difference or relationship. With studies based on probability samples, effect size indices should be reported in addition to statistical significance tests in order to comment on practical significance. Where complete populations or convenience samples are worked with, the determination of statistical significance is strictly speaking no longer relevant, while the effect size indices can be used as a basis to judge significance. In this article attention is paid to the use of effect size indices in order to establish practical significance. It is also shown how these indices are utilized in a few fields of statistical application and how it receives attention in statistical literature and computer packages. The use of effect sizes is illustrated by a few examples from the research literature.


2020 ◽  
Author(s):  
Zoltan Dienes

Obtaining evidence that something does not exist requires knowing how big it would be were it to exist. Testing a theory that predicts an effect thus entails specifying the range of effect sizes consistent with the theory, in order to know when the evidence counts against the theory. Indeed, a theoretically relevant effect size must be specified for power calculations, equivalence testing, and Bayes factors in order that the inferential statistics test the theory. Specifying relevant effect sizes for power, or the equivalence region for equivalence testing, or the scale factor for Bayes factors, is necessary for many journal formats, such as registered reports, and should be necessary for all articles that use hypothesis testing. Yet there is little systematic advice on how to approach this problem. This article offers some principles and practical advice for specifying theoretically relevant effect sizes for hypothesis testing.


2017 ◽  
Vol 38 (5) ◽  
pp. 551-557 ◽  
Author(s):  
Hiok Yang Chan ◽  
Jerry Yongqiang Chen ◽  
Suraya Zainul-Abidin ◽  
Hao Ying ◽  
Kevin Koo ◽  
...  

Background: The American Orthopaedic Foot & Ankle Society (AOFAS) score is one of the most common and adapted outcome scales in hallux valgus surgery. However, AOFAS is predominantly physician based and not patient based. Although it may be straightforward to derive statistical significance, it may not equate to the true subjective benefit of the patient’s experience. There is a paucity of literature defining MCID for AOFAS in hallux valgus surgery although it could have a great impact on the accuracy of analyzing surgical outcomes. Hence, the primary aim of this study was to define the Minimal Clinically Important Difference (MCID) for the AOFAS score in these patients, and the secondary aim was to correlate patients’ demographics to the MCID. Methods: We conducted a retrospective cross-sectional study. A total of 446 patients were reviewed preoperatively and followed up for 2 years. An anchor question was asked 2 years postoperation: “How would you rate the overall results of your treatment for your foot and ankle condition?” (excellent, very good, good, fair, poor, terrible). The MCID was derived using 4 methods, 3 from an anchor-based approach and 1 from a distribution-based approach. Anchor-based approaches were (1) mean difference in 2-year AOFAS scores of patients who answered “good” versus “fair” based on the anchor question; (2) mean change of AOFAS score preoperatively and at 2-year follow-up in patients who answered good; (3) receiver operating characteristic (ROC) curves method, where the area under the curve (AUC) represented the likelihood that the scoring system would accurately discriminate these 2 groups of patients. The distribution-based approach used to calculate MCID was the effect size method. There were 405 (90.8%) females and 41 (9.2%) males. Mean age was 51.2 (standard deviation [SD] = 13) years, mean preoperative BMI was 24.2 (SD = 4.1). Results: Mean preoperative AOFAS score was 55.6 (SD = 16.8), with significant improvement to 85.7 (SD = 14.4) in 2 years ( P value < .001). There were no statistical differences between demographics or preoperative AOFAS scores of patients with good versus fair satisfaction levels. At 2 years, patients who had good satisfaction had higher AOFAS scores than fair satisfaction (83.9 vs 78.1, P < .001) and higher mean change (30.2 vs 22.3, P = .015). Mean change in AOFAS score in patients with good satisfaction was 30.2 (SD = 19.8). Mean difference in good versus fair satisfaction was 7.9. Using ROC analysis, the cut-off point is 29.0, with an area under the curve (AUC) of 0.62. Effect size method derived an MCID of 8.4 with a moderate effect size of 0.5. Multiple linear regression demonstrated increasing age (β = −0.129, CI = −0.245, –0.013, P = .030) and higher preoperative AOFAS score (β = −0.874, CI = −0.644, –0.081, P < .001) to significantly decrease the amount of change in the AOFAS score. Conclusion: The MCID of AOFAS score in hallux valgus surgery was 7.9 to 30.2. The MCID can ensure clinical improvement from a patient’s perspective and also aid in interpreting results from clinical trials and other studies. Level of Evidence: Level III, retrospective comparative series.


1990 ◽  
Vol 24 (3) ◽  
pp. 405-415 ◽  
Author(s):  
Nathaniel McConaghy

Meta-analysis replaced statistical significance with effect size in the hope of resolving controversy concerning evaluation of treatment effects. Statistical significance measured reliability of the effect of treatment, not its efficacy. It was strongly influenced by the number of subjects investigated. Effect size as assessed originally, eliminated this influence but by standardizing the size of the treatment effect could distort it. Meta-analyses which combine the results of studies which employ different subject types, outcome measures, treatment aims, no-treatment rather than placebo controls or therapists with varying experience can be misleading. To ensure discussion of these variables meta-analyses should be used as an aid rather than a substitute for literature review. While meta-analyses produce contradictory findings, it seems unwise to rely on the conclusions of an individual analysis. Their consistent finding that placebo treatments obtain markedly higher effect sizes than no treatment hopefully will render the use of untreated control groups obsolete.


2005 ◽  
Vol 77 (1) ◽  
pp. 45-76 ◽  
Author(s):  
Lee-Ann C. Hayek ◽  
W. Ronald Heyer

Several analytic techniques have been used to determine sexual dimorphism in vertebrate morphological measurement data with no emergent consensus on which technique is superior. A further confounding problem for frog data is the existence of considerable measurement error. To determine dimorphism, we examine a single hypothesis (Ho = equal means) for two groups (females and males). We demonstrate that frog measurement data meet assumptions for clearly defined statistical hypothesis testing with statistical linear models rather than those of exploratory multivariate techniques such as principal components, correlation or correspondence analysis. In order to distinguish biological from statistical significance of hypotheses, we propose a new protocol that incorporates measurement error and effect size. Measurement error is evaluated with a novel measurement error index. Effect size, widely used in the behavioral sciences and in meta-analysis studies in biology, proves to be the most useful single metric to evaluate whether statistically significant results are biologically meaningful. Definitions for a range of small, medium, and large effect sizes specifically for frog measurement data are provided. Examples with measurement data for species of the frog genus Leptodactylus are presented. The new protocol is recommended not only to evaluate sexual dimorphism for frog data but for any animal measurement data for which the measurement error index and observed or a priori effect sizes can be calculated.


2019 ◽  
Vol 15 (5) ◽  
pp. 20190174 ◽  
Author(s):  
Lewis G. Halsey

The p -value has long been the figurehead of statistical analysis in biology, but its position is under threat. p is now widely recognized as providing quite limited information about our data, and as being easily misinterpreted. Many biologists are aware of p 's frailties, but less clear about how they might change the way they analyse their data in response. This article highlights and summarizes four broad statistical approaches that augment or replace the p -value, and that are relatively straightforward to apply. First, you can augment your p -value with information about how confident you are in it, how likely it is that you will get a similar p -value in a replicate study, or the probability that a statistically significant finding is in fact a false positive. Second, you can enhance the information provided by frequentist statistics with a focus on effect sizes and a quantified confidence that those effect sizes are accurate. Third, you can augment or substitute p -values with the Bayes factor to inform on the relative levels of evidence for the null and alternative hypotheses; this approach is particularly appropriate for studies where you wish to keep collecting data until clear evidence for or against your hypothesis has accrued. Finally, specifically where you are using multiple variables to predict an outcome through model building, Akaike information criteria can take the place of the p -value, providing quantified information on what model is best. Hopefully, this quick-and-easy guide to some simple yet powerful statistical options will support biologists in adopting new approaches where they feel that the p -value alone is not doing their data justice.


Sign in / Sign up

Export Citation Format

Share Document