scholarly journals Estimating the Reproducibility of Psychological Science

Author(s):  
Brian A. Nosek ◽  
Johanna Cohoon ◽  
Mallory Kidwell ◽  
Jeffrey Robert Spies

Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

2020 ◽  
pp. 1-9
Author(s):  
Devin S. Kielur ◽  
Cameron J. Powden

Context: Impaired dorsiflexion range of motion (DFROM) has been established as a predictor of lower-extremity injury. Compression tissue flossing (CTF) may address tissue restrictions associated with impaired DFROM; however, a consensus is yet to support these effects. Objectives: To summarize the available literature regarding CTF on DFROM in physically active individuals. Evidence Acquisition: PubMed and EBSCOhost (CINAHL, MEDLINE, and SPORTDiscus) were searched from 1965 to July 2019 for related articles using combination terms related to CTF and DRFOM. Articles were included if they measured the immediate effects of CTF on DFROM. Methodological quality was assessed using the Physiotherapy Evidence Database scale. The level of evidence was assessed using the Strength of Recommendation Taxonomy. The magnitude of CTF effects from pre-CTF to post-CTF and compared with a control of range of motion activities only were examined using Hedges g effect sizes and 95% confidence intervals. Randomeffects meta-analysis was performed to synthesize DFROM changes. Evidence Synthesis: A total of 6 studies were included in the analysis. The average Physiotherapy Evidence Database score was 60% (range = 30%–80%) with 4 out of 6 studies considered high quality and 2 as low quality. Meta-analysis indicated no DFROM improvements for CTF compared with range of motion activities only (effect size = 0.124; 95% confidence interval, −0.137 to 0.384; P = .352) and moderate improvements from pre-CTF to post-CTF (effect size = 0.455; 95% confidence interval, 0.022 to 0.889; P = .040). Conclusions: There is grade B evidence to suggest CTF may have no effect on DFROM when compared with a control of range of motion activities only and results in moderate improvements from pre-CTF to post-CTF. This suggests that DFROM improvements were most likely due to exercises completed rather than the band application.


2018 ◽  
Vol 22 (4) ◽  
pp. 469-476 ◽  
Author(s):  
Ian J. Davidson

The reporting and interpretation of effect sizes is often promoted as a panacea for the ramifications of institutionalized statistical rituals associated with the null-hypothesis significance test. Mechanical objectivity—conflating the use of a method with the obtainment of truth—is a useful theoretical tool for understanding the possible failure of effect size reporting ( Porter, 1995 ). This article helps elucidate the ouroboros of psychological methodology. This is the cycle of improved tools to produce trustworthy knowledge, leading to their institutionalization and adoption as forms of thinking, leading to methodologists eventually admonishing researchers for relying too heavily on rituals, finally leading to the production of more new improved quantitative tools that may follow along this circular path. Despite many critiques and warnings, research psychologists’ superficial adoption of effect sizes might preclude expert interpretation much like in the null-hypothesis significance test as widely received. One solution to this situation is bottom-up: promoting a balance of mechanical objectivity and expertise in the teaching of methods and research. This would require the acceptance and encouragement of expert interpretation within psychological science.


2018 ◽  
Author(s):  
Robert Calin-Jageman

This paper has now been published in the Journal of Undergraduate Neuroscience Eduction: http://www.funjournal.org/wp-content/uploads/2018/04/june-16-e21.pdf?x91298. See also, this record on PubMed and PubMedCentral: https://www.ncbi.nlm.nih.gov/pubmed/30057503. An ongoing reform in statistical practice is to report and interpret effect sizes. This paper provides a short tutorial on effect sizes and some tips on how to help your students think in terms of effect sizes when analyzing data. An effect size is just a quantitative answer to a research question. Effect sizes should always be accompanied by a confidence interval or some other means of expressing uncertainty in generalizing from the sample to the population. Effect sizes are best interpreted in raw scores, but can also be expressed in standardized terms; several popular standardized effect score measures are explained and compared. Training your students to reporting and interpreting effect sizes can help them be better scientists: it will help them think critically about the practical significance of their results, makes uncertainty salient, foster better planning for subsequent experiments, encourage meta-analytic thinking, and can help focus their efforts on optimizing measurement. You can help your students start to think in effect sizes by giving them tools to visualize and translate between different effect size measures, and by tasking them to build a ‘library’ of effect sizes in a research field of interest.


1994 ◽  
Vol 5 (6) ◽  
pp. 329-334 ◽  
Author(s):  
Robert Rosenthal ◽  
Donald B. Rubin

We introduce a new, readily computed statistic, the counternull value of an obtained effect size, which is the nonnull magnitude of effect size that is supported by exactly the same amount of evidence as supports the null value of the effect size In other words, if the counternull value were taken as the null hypothesis, the resulting p value would be the same as the obtained p value for the actual null hypothesis Reporting the counternull, in addition to the p value, virtually eliminates two common errors (a) equating failure to reject the null with the estimation of the effect size as equal to zero and (b) taking the rejection of a null hypothesis on the basis of a significant p value to imply a scientifically important finding In many common situations with a one-degree-of-freedom effect size, the value of the counternull is simply twice the magnitude of the obtained effect size, but the counternull is defined in general, even with multi-degree-of-freedom effect sizes, and therefore can be applied when a confidence interval cannot be The use of the counter-null can be especially useful in meta-analyses when evaluating the scientific importance of summary effect sizes


Author(s):  
David J. Miller ◽  
James T. Nguyen ◽  
Matteo Bottai

Artificial effect-size magnification (ESM) may occur in underpowered studies, where effects are reported only because they or their associated p-values have passed some threshold. Ioannidis (2008, Epidemiology 19: 640–648) and Gelman and Carlin (2014, Perspectives on Psychological Science 9: 641–651) have suggested that the plausibility of findings for a specific study can be evaluated by computation of ESM, which requires statistical simulation. In this article, we present a new command called emagnification that allows straightforward implementation of such simulations in Stata. The commands automate these simulations for epidemiological studies and enable the user to assess ESM routinely for published studies using user-selected, study-specific inputs that are commonly reported in published literature. The intention of the command is to allow a wider community to use ESMs as a tool for evaluating the reliability of reported effect sizes and to put an observed statistically significant effect size into a fuller context with respect to potential implications for study conclusions.


1983 ◽  
Vol 8 (2) ◽  
pp. 93-101 ◽  
Author(s):  
Helena Chmura Kraemer

Approximations to the distribution of a common form of effect size are presented. Single sample tests, confidence interval formulation, tests of homogeneity and pooling procedures are based on these approximations. Caveats are presented concerning statistical procedures as applied to sample effect sizes commonly used in meta-analysis.


2021 ◽  
Author(s):  
Farid Anvari ◽  
Rogier Kievit ◽  
Daniel Lakens ◽  
Andrew K Przybylski ◽  
Leonid Tiokhin ◽  
...  

Psychological researchers currently lack guidance for how to evaluate the practical relevance of observed effect sizes, i.e. whether a finding will have impact when translated to a different context of application. Although psychologists have recently highlighted theoretical justifications for why small effect sizes might be practically relevant, such justifications are simplistic and fail to provide the information necessary for evaluation and falsification. Claims about whether an observed effect size is practically relevant need to consider both the mechanisms amplifying and counteracting practical relevance, as well as the assumptions underlying each mechanism at play. To provide guidance for systematically evaluating whether an observed effect size is practically relevant, we present examples of widely applicable mechanisms and the key assumptions needed for justifying whether an observed effect size can be expected to generalize to different contexts. Routine use of these mechanisms to justify claims about practical relevance has the potential to make researchers’ claims about generalizability substantially more transparent. This transparency can help move psychological science towards a more rigorous assessment of when psychological findings can be applied in the world.


2017 ◽  
Vol 28 (12) ◽  
pp. 1871-1871

Original article: Giner-Sorolla, R., & Chapman, H. A. (2017). Beyond purity: Moral disgust toward bad character. Psychological Science, 28, 80–91. doi:10.1177/0956797616673193 In this article, some effect sizes in the Results section for Study 1 were reported incorrectly and are now being corrected. In the section titled Manipulation Checks: Act and Character Ratings, we reported a d value of 0.32 for the one-sample t test comparing participants’ act ratings with the midpoint of the scale; the correct value is 0.30. The sentence should read as follows: Follow-up one-sample t tests using the midpoint of the scale as a test value (because participants compared John with Robert) indicated that the cat beater’s actions were judged to be less wrong than the woman beater’s actions, t(86) = −2.82, p = .006, d = 0.30. In the section titled Emotion Ratings, we reported a d value of 0.42 for the paired-samples t test comparing relative ratings of facial disgust and facial anger; the correct value is 0.34. In addition, the effect-size statistic is dz rather than d. The sentence should read as follows: As predicted, a paired-samples t test indicated that relative facial-disgust ratings ( M = 4.36, SE = 0.21) were significantly different from relative facial-anger ratings ( M = 3.63, SE = 0.20), t(86) = −3.12, p = .002, dz = 0.34; this indicates that the cat-beater and woman-beater scenarios differentially evoked disgust and anger. Later in that section, we reported a d value of 0.21 for the one-sample t test comparing ratings of facial disgust with the midpoint of the scale; the correct value is 0.20. In the same sentence, we reported a d value of 0.21 for the one-sample t test comparing ratings of facial anger with the midpoint of the scale; the correct value is 0.19. The sentence should read as follows: Follow-up one-sample t tests against the midpoint of the scale showed trends in the predicted directions, with higher disgust for the cat beater compared with the woman beater, t(86) = 1.7, p = .088, d = 0.20, and higher anger for the woman beater compared with the cat beater, t(86) = −1.82, p = .072, d = 0.19 (see Fig. 1). These errors do not affect the significance of the results or the overall conclusions for Study 1.


2020 ◽  
Author(s):  
Eric Jamaal Cooks ◽  
Scott Parrott ◽  
Danielle Deavours

Interpretations of effect size are typically made either through comparison against previous studies or established benchmarks. This study examines the distribution of published effects among studies with and without preregistration in a set of 22 communication journals. Building from previous research in psychological science, 440 effects were randomly drawn from past publications without preregistration and compared against 35 effects from preregistered studies, and against Cohen’s conventions for effect size. Reported effects from studies without preregistration (median r = .33) were larger compared to those with a preregistration plan (median r = 0.24). The magnitude of effects from studies without preregistration was greater across conventions for “small,” and “large” effects. Differences were also found based on communication subdiscipline. These findings suggest that studies without preregistration may overestimate population effects, and that global conventions may not be applicable in communication science.


1994 ◽  
Vol 21 (1) ◽  
pp. 150-175 ◽  
Author(s):  
MARTIN L. LALUMIÈRE ◽  
VERNON L. QUINSEY

The authors examined how well identified rapists could be discriminated from non-sex offenders using phallometric assessments, what variables might moderate this discrimination, and whether rapists respond more to descriptions of rape than to consenting sex. Eleven primary and five secondary phallometric studies involving 415 rapists and 192 non-sex offenders were examined using meta-analytic techniques. Study effect sizes averaged 0.82 (95% confidence interval 0.16 to 1.49). Only stimulus set was a statistically significant moderator of effect size: Stimulus sets that contained more graphic rape descriptions produced better discrimination between rapists and non-sex offenders. There was a trend for stimulus sets that contained more exemplars off rape descriptions to achieve better discrimination. Also, rapists responded more to rape than to consenting sex cues in 9 of the 16 data sets and in all 8 of those using the more effective stimulus sets.


Sign in / Sign up

Export Citation Format

Share Document