scholarly journals Beyond p-values: Utilizing Multiple Estimates to Evaluate Evidence

2017 ◽  
Author(s):  
Kathrene D Valentine ◽  
Erin Michelle Buchanan ◽  
John E. Scofield ◽  
Marshall T. Beauchamp

Null hypothesis significance testing is frequently cited as a threat to the validity and reproducibility of the social sciences. While many individuals suggest we should focus on altering the *p*-value at which we deem an effect significant, we believe this suggestion is short-sighted. Alternative procedures (i.e., Bayesian analyses and Observation Oriented Modeling) can be more powerful and meaningful to our discipline. However, these methodologies are less frequently utilized and are rarely discussed in combination with NHST. Herein, we compare the possible interpretations of three analyses (ANOVA, Bayes Factor, and an Ordinal Pattern Analysis) in various data environments using a simulation study. The simulation generated 20000 unique datasets which varied sample size (*N*s of 10, 30, 100, 500, 1000), and effect sizes (*d*s of 0.10, 0.20, 0.05, 0.80). Through this simulation, we find that changing the threshold at which *p*-values are considered significant has little to no effect on conclusions. Further, we find that evaluating multiple estimates as evidence of an effect can allow for a more robust and nuanced report of findings. These findings suggest the need to redefine evidentiary value and reporting practices.

2014 ◽  
Author(s):  
Joost de Winter ◽  
Dimitra Dodou

It is known that statistically significant results are more likely to be published than results that are not statistically significant. However, it is unclear whether negative results are disappearing from papers, and whether there exists a ‘hierarchy of sciences’ with the social sciences publishing more positive results than the physical sciences. Using Scopus, we conducted a search in the abstracts of papers published between 1990 and 2014, and calculated the percentage of papers reporting marginally positive results (i.e., p-values between 0.040 and 0.049) versus the percentage of papers reporting marginally negative results (i.e., p-values between 0.051 and 0.060). The results indicate that negative results are not disappearing, but have actually become 4.3 times more prevalent since 1990. Positive results, on the other hand, have become 13.9 times more prevalent since 1990. We found no consistent support for a ‘hierarchy of sciences’. However, we did find large differences in reporting practices between disciplines, with the reporting of p-values being 60.6 times more frequent in the biological sciences than in the physical sciences. We argue that the observed longitudinal trends may be caused by negative factors, such as an increase of questionable research practices, but also by positive factors, such as an increasingly quantitative research focus.


2021 ◽  
Author(s):  
Willem M Otte ◽  
Christiaan H Vinkers ◽  
Philippe Habets ◽  
David G P van IJzendoorn ◽  
Joeri K Tijdink

Abstract Objective To quantitatively map how non-significant outcomes are reported in randomised controlled trials (RCTs) over the last thirty years. Design Quantitative analysis of English full-texts containing 567,758 RCTs recorded in PubMed (81.5% of all published RCTs). Methods We determined the exact presence of 505 pre-defined phrases denoting results that do not reach formal statistical significance (P<0.05) in 567,758 RCT full texts between 1990 and 2020 and manually extracted associated P values. Phrase data was modeled with Bayesian linear regression. Evidence for temporal change was obtained through Bayes-factor analysis. In a randomly sampled subset, the associated P values were manually extracted. Results We identified 61,741 phrases indicating close to significant results in 49,134 (8.65%; 95% confidence interval (CI): 8.58–8.73) RCTs. The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being ‘marginally significant’ (in 7,735 RCTs), ‘all but significant’ (7,015), ‘a nonsignificant trend’ (3,442), ‘failed to reach statistical significance’ (2,578) and ‘a strong trend’ (1,700). The strongest evidence for a temporal prevalence increase was found for ‘a numerical trend’, ‘a positive trend’, ‘an increasing trend’ and ‘nominally significant’. The phrases ‘all but significant’, ‘approaches statistical significance’, ‘did not quite reach statistical significance’, ‘difference was apparent’, ‘failed to reach statistical significance’ and ‘not quite significant’ decreased over time. In the random sampled subset, the 11,926 identified P values ranged between 0.05 and 0.15 (68.1%; CI: 67.3–69.0; median 0.06). Conclusions Our results demonstrate that phrases describing marginally significant results are regularly used in RCTs to report P values close to but above the dominant 0.05 cut-off. The phrase prevalence remained stable over time, despite all efforts to change the focus from P < 0.05 to reporting effect sizes and corresponding confidence intervals. To improve transparency and enhance responsible interpretation of RCT results, researchers, clinicians, reviewers, and editors need to abandon the focus on formal statistical significance thresholds and stimulate reporting of exact P values with corresponding effect sizes and confidence intervals. Significance statement The power of language to modify the reader’s perception of how to interpret biomedical results cannot be underestimated. Misreporting and misinterpretation are urgent problems in RCT output. This may be at least partially related to the statistical paradigm of the 0.05 significance threshold. Sometimes, creativity and inventive strategies of clinical researchers may be used – describing their clinical results to be ‘almost significant’ – to get their data published. This phrasing may convince readers about the value of their work. Since 2005 there is an increasing concern that most current published research findings are false and it has been generally advised to switch from null hypothesis significance testing to using effect sizes, estimation, and cumulation of evidence. If this ‘new statistics’ approach has worked out well should be reflected in the phases describing non-significance results of RCTs. In particular in changing patterns describing P values just above 0.05 value. More than five hundred phrases potentially suited to report or discuss non-significant results were searched in over half a million published RCTs. A stable overall prevalence of these phrases (10.87%, CI: 10.79–10.96; N: 61,741), with associated P values close to 0.05, was found in the last three decades, with strong increases or decreases in individual phrases describing these near-significant results. The pressure to pass scientific peer-review barrier may function as an incentive to use effective phrases to mask non-significant results in RCTs. However, this keeps the researcher’s pre-occupied with hypothesis testing rather than presenting outcome estimations with uncertainty. The effect of language on getting RCT results published should ideally be minimal to steer evidence-based medicine away from overselling of research results, unsubstantiated claims about the efficacy of certain RCTs and to prevent an over-reliance on P value cutoffs. Our exhaustive search suggests that presenting RCT findings remains a struggle when P values approach the carved-in-stone threshold of 0.05.


2021 ◽  
Vol 18 (1) ◽  
pp. 73-99
Author(s):  
Petr Soukup

The P value was introduced as a value to evaluate the results of statistical tests. The basic concept originated in the 1920s, and after the Second World War it was significantly expanded. For about the last three decades there has been intense discussion about the problematic features of the P value concept and its use in science, and voices calling to abolish use of the P value are growing louder. In addition, suggestions have been made for alternative procedures that could replace or supplement the P value. Statisticians have tried to invent an indicator similar to the P value, but without its weaknesses. There are many of these options. Besides alternatives within the classical statistical testing paradigm, the use of an alternative statistical approach, so-called Bayesian statistics, is increasingly being discussed. An example of a moderate recommendation is that of using the Bayes factor, essentially an analogue of the P value in the Bayesian world. The aim of this article is to present the Bayes factor in detail, to describe its similarities and dissimilarities with the P value, and discuss the possibilities of its calculation. In addition to computational procedures, a detailed discussion of the weaknesses of the Bayes factor is also included.


2014 ◽  
Author(s):  
Joost de Winter ◽  
Dimitra Dodou

It is known that statistically significant results are more likely to be published than results that are not statistically significant. However, it is unclear whether negative results are disappearing from papers, and whether there exists a ‘hierarchy of sciences’ with the social sciences publishing more positive results than the physical sciences. Using Scopus, we conducted a search in the abstracts of papers published between 1990 and 2014, and calculated the percentage of papers reporting marginally positive results (i.e., p-values between 0.040 and 0.049) versus the percentage of papers reporting marginally negative results (i.e., p-values between 0.051 and 0.060). The results indicate that negative results are not disappearing, but have actually become 4.3 times more prevalent since 1990. Positive results, on the other hand, have become 13.9 times more prevalent since 1990. We found no consistent support for a ‘hierarchy of sciences’. However, we did find large differences in reporting practices between disciplines, with the reporting of p-values being 60.6 times more frequent in the biological sciences than in the physical sciences. We argue that the observed longitudinal trends may be caused by negative factors, such as an increase of questionable research practices, but also by positive factors, such as an increasingly quantitative research focus.


2020 ◽  
Vol 80 (ET.2020) ◽  
pp. 1-11
Author(s):  
Majid Farag Hichim

This study was conducted to estimate the stopping distance that is needed to halt a vehicle at different speeds. In order to improve the equality of evaluation, age and gender of the drivers, which affects the reaction times (RTs), were taken into consideration. The measurements of the RTs in a simulated driving environment were executed and the acquired results were statistically analyzed using "Statistical Package for the Social Sciences" (SPSS). The results indicate that participants' ages and genders had a significant relationship with the RTs as P-values < 0.01 and 0.022, respectively. Then, the Analysis of Variance (ANOVA) was conducted to compare the effect of the predictors' ages, genders, and vehicle's speeds on the vehicle's stopping distance. The overall regression model showed that these predictors had a high significant effect on stopping distance as F(3, 46) = 777.05, P-value < 0.01, and R-square = 0.98. However, gender exclusively was not significant as its P-value > 0.24.


2014 ◽  
Author(s):  
Joost de Winter ◽  
Dimitra Dodou

It is known that statistically significant results are more likely to be published than results that are not statistically significant. However, it is unclear whether negative results are disappearing from papers, and whether there exists a ‘hierarchy of sciences’ with the social sciences publishing more positive results than the physical sciences. Using Scopus, we conducted a search in the abstracts of papers published between 1990 and 2014, and calculated the percentage of papers reporting marginally positive results (i.e., p-values between 0.040 and 0.049) versus the percentage of papers reporting marginally negative results (i.e., p-values between 0.051 and 0.060). The results indicate that negative results are not disappearing, but have actually become 4.3 times more prevalent since 1990. Positive results, on the other hand, have become 13.9 times more prevalent since 1990. We found no consistent support for a ‘hierarchy of sciences’. However, we did find large differences in reporting practices between disciplines, with the reporting of p-values being 60.6 times more frequent in the biological sciences than in the physical sciences. We argue that the observed longitudinal trends may be caused by negative factors, such as an increase of questionable research practices, but also by positive factors, such as an increasingly quantitative research focus.


2020 ◽  
Vol 79 (Suppl 1) ◽  
pp. 1048.2-1048
Author(s):  
S. Herrera ◽  
J. C. Diaz-Coronado ◽  
D. Rojas-Gualdrón ◽  
L. Betancur-Vasquez ◽  
D. Gonzalez-Hurtado ◽  
...  

Background:Systemic lupus erythematosus (SLE) clinical manifestations, and their severity, vary according to age, ethnicity and socioeconomic status. Both Hispanic and Afro-Americans have a higher incidence and more sever presentation when compared to Caucasian patients with SLEObjectives:To analyze clinical and immunological characteristics associated with time to severe renal involvement in patients with Systemic Lupus Erythematous in a Colombian cohort followed for one year, between January 2015 and December 2018Methods:Retrospective follow-up study based in clinical records. Patients with SLE diagnosis that fulfilled either 1987 American College of Rheumatology Classification Criteria for SLE or 2011 Systemic Lupus International Collaborating Clinics (SLICC) classification criteria for SLE. We included patients with diagnosis of lupus nephritis according to Wallace and Dubois criteria. Patients who did not have at least two follow-up measurements or had a cause of nephritis other than lupus were excluded. The main outcome was defined as time from diagnosis to sever renal involvement defined as creatinine clearance ≤50 ml/min, 24-hour proteinuria ≥3.5 grams o end stage renal disease.We analyzed clinical and immunological characteristics. Descriptive statistical analyses of participant data during the first evaluation are reported as frequencies and percentages for categorical variables, and as medians and interquartile ranges (IQR) for quantitative variables. Age and sex adjusted survival functions and Hazard ratios (HR) with 95% confidence intervals and p-values were estimated using parametric Weibull models por interval-censored data. P values < 0.05 were considered statistically significantResults:548 patients were analyzed: 67 were left-censored as they presented renal involvement at entry, 6 were interval censored as outcome occurred between study visits, and 475 were right-censored as involvement was not registered during follow-up. 529 (96.5%) patients were female, median age at entry was 46 (IQR = 23) and median age to diagnosis was 29.5 (IQR = 20.6). 67% were mestizo, 13% Caucasian and 0.3% Afro-Colombian. Age and sex adjusted variables associated with time to severe lupus nephritis were high blood pressure HR = 3.5 (95%CI 2.2-5.6; p-value <0.001) and Anti-RO (per unit increase) HR = 1.002 (95%CI 1.001-1.004; p-value = 0.04). Figure 1 shows age and sex adjusted survival function.Conclusion:In our cohort the appearance of severe lupus nephritis occurs in less than 15% of patients at 10 years. Both high blood pressure and elevated anti-Ro titers were associated with a higher rate of onset in the presentation of severe lupus nephritis, as seen in some polymorphs of anti Ro.References:Disclosure of Interests:Sebastian Herrera Speakers bureau: academic conference, Juan camilo Diaz-Coronado: None declared, Diego Rojas-Gualdrón: None declared, Laura Betancur-Vasquez: None declared, Daniel Gonzalez-Hurtado: None declared, Juanita Gonzalez-Arango: None declared, laura Uribe-Arango: None declared, Maria Fernanda Saavedra Chacón: None declared, Jorge Lacouture-Fierro: None declared, Santiago Monsalve: None declared, Sebastian Guerra-Zarama: None declared, Juan david Lopez: None declared, Juan david Serna: None declared, Julian Barbosa: None declared, Ana Sierra: None declared, Deicy Hernandez-Parra: None declared, Ricardo Pineda.Tamayo: None declared


PEDIATRICS ◽  
1996 ◽  
Vol 98 (6) ◽  
pp. A22-A22
Author(s):  
Student

When we are told that "there's no evidence that A causes B," we should first ask whether absence of evidence means simply that there is no information at all. If there are data, we should look for quantification of the association rather than just a P value. Where risks are small, P values may well mislead: confidence intervals are likely to be wide, indicating considerable uncertainty.


Sign in / Sign up

Export Citation Format

Share Document