Abstract
Objective
To quantitatively map how non-significant outcomes are reported in
randomised controlled trials (RCTs) over the last thirty years.
Design
Quantitative analysis of English full-texts containing 567,758 RCTs
recorded in PubMed (81.5% of all published RCTs).
Methods
We determined the exact presence of 505 pre-defined phrases denoting
results that do not reach formal statistical significance (P<0.05) in
567,758 RCT full texts between 1990 and 2020 and manually extracted
associated P values. Phrase data was modeled with Bayesian linear
regression. Evidence for temporal change was obtained through Bayes-factor
analysis. In a randomly sampled subset, the associated P values were
manually extracted.
Results
We identified 61,741 phrases indicating close to significant results in
49,134 (8.65%; 95% confidence interval (CI): 8.58–8.73) RCTs. The overall
prevalence of these phrases remained stable over time, with the most
prevalent phrases being ‘marginally significant’ (in 7,735 RCTs), ‘all but
significant’ (7,015), ‘a nonsignificant trend’ (3,442), ‘failed to reach
statistical significance’ (2,578) and ‘a strong trend’ (1,700). The
strongest evidence for a temporal prevalence increase was found for ‘a
numerical trend’, ‘a positive trend’, ‘an increasing trend’ and ‘nominally
significant’. The phrases ‘all but significant’, ‘approaches statistical
significance’, ‘did not quite reach statistical significance’, ‘difference
was apparent’, ‘failed to reach statistical significance’ and ‘not quite
significant’ decreased over time. In the random sampled subset, the 11,926
identified P values ranged between 0.05 and 0.15 (68.1%; CI: 67.3–69.0;
median 0.06).
Conclusions
Our results demonstrate that phrases describing marginally significant
results are regularly used in RCTs to report P values close to but above the
dominant 0.05 cut-off. The phrase prevalence remained stable over time,
despite all efforts to change the focus from P < 0.05 to reporting effect
sizes and corresponding confidence intervals. To improve transparency and
enhance responsible interpretation of RCT results, researchers, clinicians,
reviewers, and editors need to abandon the focus on formal statistical
significance thresholds and stimulate reporting of exact P values with
corresponding effect sizes and confidence intervals.
Significance statement
The power of language to modify the reader’s perception of how to
interpret biomedical results cannot be underestimated. Misreporting and
misinterpretation are urgent problems in RCT output. This may be at least
partially related to the statistical paradigm of the 0.05 significance
threshold. Sometimes, creativity and inventive strategies of clinical
researchers may be used – describing their clinical results to be ‘almost
significant’ – to get their data published. This phrasing may convince
readers about the value of their work. Since 2005 there is an increasing
concern that most current published research findings are false and it has
been generally advised to switch from null hypothesis significance testing
to using effect sizes, estimation, and cumulation of evidence. If this ‘new
statistics’ approach has worked out well should be reflected in the phases
describing non-significance results of RCTs. In particular in changing
patterns describing P values just above 0.05 value.
More than five hundred phrases potentially suited to report or discuss
non-significant results were searched in over half a million published RCTs.
A stable overall prevalence of these phrases (10.87%, CI: 10.79–10.96; N:
61,741), with associated P values close to 0.05, was found in the last three
decades, with strong increases or decreases in individual phrases describing
these near-significant results. The pressure to pass scientific peer-review
barrier may function as an incentive to use effective phrases to mask
non-significant results in RCTs. However, this keeps the researcher’s
pre-occupied with hypothesis testing rather than presenting outcome
estimations with uncertainty. The effect of language on getting RCT results
published should ideally be minimal to steer evidence-based medicine away
from overselling of research results, unsubstantiated claims about the
efficacy of certain RCTs and to prevent an over-reliance on P value cutoffs.
Our exhaustive search suggests that presenting RCT findings remains a
struggle when P values approach the carved-in-stone threshold of
0.05.