scholarly journals Outlier Exclusion Procedures Must be Blind to the Researcher’s Hypothesis

2021 ◽  
Author(s):  
Quentin André

When researchers choose to identify and exclude outliers from their data, should they do so across all the data, or within experimental conditions? A survey of recent papers published in the Journal of Experimental Psychology: General shows that both methods are widely used, and common data visualization techniques suggest that outliers should be excluded at the condition-level. However, I highlight in the present paper that removing outliers by condition runs against the logic of hypothesis testing, and that this practice leads to unacceptable increases in false-positive rates. I demonstrate that this conclusion holds true across a variety of statistical tests, exclusion criterion and cutoffs, sample sizes, and data types, and show in simulated experiments and in a re-analysis of existing data that by-condition exclusions can result in false-positive rates as high as 43%. I finally demonstrate that by-condition exclusions are a specific case of a more general issue: Any outlier exclusion procedure that is not blind to the hypothesis that researchers want to test may result in inflated Type I errors. I conclude by offering best practices and recommendations for excluding outliers.

2015 ◽  
Vol 23 (2) ◽  
pp. 306-312 ◽  
Author(s):  
Annie Franco ◽  
Neil Malhotra ◽  
Gabor Simonovits

The accuracy of published findings is compromised when researchers fail to report and adjust for multiple testing. Preregistration of studies and the requirement of preanalysis plans for publication are two proposed solutions to combat this problem. Some have raised concerns that such changes in research practice may hinder inductive learning. However, without knowing the extent of underreporting, it is difficult to assess the costs and benefits of institutional reforms. This paper examines published survey experiments conducted as part of the Time-sharing Experiments in the Social Sciences program, where the questionnaires are made publicly available, allowing us to compare planned design features against what is reported in published research. We find that: (1) 30% of papers report fewer experimental conditions in the published paper than in the questionnaire; (2) roughly 60% of papers report fewer outcome variables than what are listed in the questionnaire; and (3) about 80% of papers fail to report all experimental conditions and outcomes. These findings suggest that published statistical tests understate the probability of type I errors.


2020 ◽  
Author(s):  
Jeff Miller

Contrary to the warning of Miller (1988), Rousselet and Wilcox (2020) argued that it is better to summarize each participant’s single-trial reaction times (RTs) in a given condition with the median than with the mean when comparing the central tendencies of RT distributions across experimental conditions. They acknowledged that median RTs can produce inflated Type I error rates when conditions differ in the number of trials tested, consistent with Miller’s warning, but they showed that the bias responsible for this error rate inflation could be eliminated with a bootstrap bias correction technique. The present simulations extend their analysis by examining the power of bias-corrected medians to detect true experimental effects and by comparing this power with the power of analyses using means and regular medians. Unfortunately, although bias-corrected medians solve the problem of inflated Type I error rates, their power is lower than that of means or regular medians in many realistic situations. In addition, even when conditions do not differ in the number of trials tested, the power of tests (e.g., t-tests) is generally lower using medians rather than means as the summary measures. Thus, the present simulations demonstrate that summary means will often provide the most powerful test for differences between conditions, and they show what aspects of the RT distributions determine the size of the power advantage for means.


2015 ◽  
Vol 2015 ◽  
pp. 1-5
Author(s):  
Wararit Panichkitkosolkul

An asymptotic test and an approximate test for the reciprocal of a normal mean with a known coefficient of variation were proposed in this paper. The asymptotic test was based on the expectation and variance of the estimator of the reciprocal of a normal mean. The approximate test used the approximate expectation and variance of the estimator by Taylor series expansion. A Monte Carlo simulation study was conducted to compare the performance of the two statistical tests. Simulation results showed that the two proposed tests performed well in terms of empirical type I errors and power. Nevertheless, the approximate test was easier to compute than the asymptotic test.


2017 ◽  
Author(s):  
Jesse E D Miller ◽  
Anthony Ives ◽  
Ellen Damschen

1. Plant functional traits are increasingly being used to infer mechanisms about community assembly and predict global change impacts. Of the several approaches that are used to analyze trait-environment relationships, one of the most popular is community-weighted means (CWM), in which species trait values are averaged at the site level. Other approaches that do not require averaging are being developed, including multilevel models (MLM, also called generalized linear mixed models). However, relative strengths and weaknesses of these methods have not been extensively compared. 2. We investigated three statistical models for trait-environment associations: CWM, a MLM in which traits were not included as fixed effects (MLM1), and a MLM with traits as fixed effects (MLM2). We analyzed a real plant community dataset to investigate associations between two traits and one environmental variable. We then analyzed permutations of the dataset to investigate sources of type I errors, and performed a simulation study to compare the statistical power of the methods. 3. In the analysis of real data, CWM gave highly significant associations for both traits, while MLM1 and MLM2 did not. Using P-values derived by simulating the data using the fitted MLM2, none of the models gave significant associations, showing that CWM had inflated type I errors (false positives). In the permutation tests, MLM2 performed the best of the three approaches. MLM2 still had inflated type I error rates in some situations, but this could be corrected using bootstrapping. The simulation study showed that MLM2 always had as good or better power than CWM. These simulations also confirmed the causes of type I errors from the permutation study. 4. The MLM that includes main effects of traits (MLM2) is the best method for identifying trait-environmental association in community assembly, with better type I error control and greater power. Analyses that regress CWMs on continuous environmental variables are not reliable because they are likely to produce type I errors.


2021 ◽  
Vol 23 (3) ◽  
Author(s):  
Estelle Chasseloup ◽  
Adrien Tessier ◽  
Mats O. Karlsson

AbstractLongitudinal pharmacometric models offer many advantages in the analysis of clinical trial data, but potentially inflated type I error and biased drug effect estimates, as a consequence of model misspecifications and multiple testing, are main drawbacks. In this work, we used real data to compare these aspects for a standard approach (STD) and a new one using mixture models, called individual model averaging (IMA). Placebo arm data sets were obtained from three clinical studies assessing ADAS-Cog scores, Likert pain scores, and seizure frequency. By randomly (1:1) assigning patients in the above data sets to “treatment” or “placebo,” we created data sets where any significant drug effect was known to be a false positive. Repeating the process of random assignment and analysis for significant drug effect many times (N = 1000) for each of the 40 to 66 placebo-drug model combinations, statistics of the type I error and drug effect bias were obtained. Across all models and the three data types, the type I error was (5th, 25th, 50th, 75th, 95th percentiles) 4.1, 11.4, 40.6, 100.0, 100.0 for STD, and 1.6, 3.5, 4.3, 5.0, 6.0 for IMA. IMA showed no bias in the drug effect estimates, whereas in STD bias was frequently present. In conclusion, STD is associated with inflated type I error and risk of biased drug effect estimates. IMA demonstrated controlled type I error and no bias.


2017 ◽  
Author(s):  
Olivier Naret ◽  
Nimisha Chaturvedi ◽  
Istvan Bartha ◽  
Christian Hammer ◽  
Jacques Fellay

Studies of host genetic determinants of pathogen sequence variation can identify sites of genomic conflicts, by highlighting variants that are implicated in immune response on the host side and adaptive escape on the pathogen side. However, systematic genetic differences in host and pathogen populations can lead to inflated type I (false positive) and type II (false negative) error rates in genome-wide association analyses. Here, we demonstrate through simulation that correcting for both host and pathogen stratification reduces spurious signals and increases power to detect real associations in a variety of tested scenarios. We confirm the validity of the simulations by showing comparable results in an analysis of paired human and HIV genomes.


2022 ◽  
Vol 29 (1) ◽  
pp. 1-70
Author(s):  
Radu-Daniel Vatavu ◽  
Jacob O. Wobbrock

We clarify fundamental aspects of end-user elicitation, enabling such studies to be run and analyzed with confidence, correctness, and scientific rigor. To this end, our contributions are multifold. We introduce a formal model of end-user elicitation in HCI and identify three types of agreement analysis: expert , codebook , and computer . We show that agreement is a mathematical tolerance relation generating a tolerance space over the set of elicited proposals. We review current measures of agreement and show that all can be computed from an agreement graph . In response to recent criticisms, we show that chance agreement represents an issue solely for inter-rater reliability studies and not for end-user elicitation, where it is opposed by chance disagreement . We conduct extensive simulations of 16 statistical tests for agreement rates, and report Type I errors and power. Based on our findings, we provide recommendations for practitioners and introduce a five-level hierarchy for elicitation studies.


Author(s):  
Mariusz Maziarz ◽  
Adrian Stencel

Rationale, aims, and objectives The current strategy of searching for an effective drug to treat COVID-19 relies mainly on repurposing existing therapies developed to target other diseases. There are currently more than four thousand active studies assessing the efficacy of existing drugs as therapies for COVID-19. The number of ongoing trials and the urgent need for a treatment poses the risk that false-positive results will be incorrectly interpreted as evidence for treatments’ efficacy and a ground for drug approval. Our purpose is to assess the risk of false-positive outcomes by analyzing the mechanistic evidence for the efficacy of exemplary candidates for repurposing, estimate false discovery rate, and discuss solutions to the problem of excessive hypothesis testing. Methods We estimate the expected number of false-positive results and probability of at least one false-positive result under the assumption that all tested compounds have no effect on the course of the disease. Later, we relax this assumption and analyze the sensitivity of the expected number of true-positive results to changes in the prior probability (π) that tested compounds are effective. Finally, we calculate False Positive Report Probability and expected numbers of false-positive and true-positive results for different thresholds of statistical significance, power of studies, and ratios of effective to non-effective compounds. We also review mechanistic evidence for the efficacy of two exemplary repurposing candidates (hydroxychloroquine and ACE2 inhibitors) and assess its quality to choose the plausible values of the prior probability (π) that tested compounds are effective against COVID-19. Results Our analysis shows that, due to the excessive number of statistical tests in the field of drug repurposing for COVID-19 and low prior probability (π) of the efficacy of tested compounds, positive results are far more likely to result from type-I error than reflect the effects of pharmaceutical interventions.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 1129
Author(s):  
Christopher R. Madan

Statistical analyses are often conducted with α=.05. When multiple statistical tests are conducted, this procedure needs to be adjusted to compensate for the otherwise inflated Type I error. In some instances in tabletop gaming, sometimes it is desired to roll a 20-sided dice (or `d20') twice and take the greater outcome. Here I draw from probability theory and the case of a d20, where the probability of obtaining any specific outcome is 1/20, to determine the probability of obtaining a specific outcome (Type-I error) at least once across repeated, independent statistical tests.


2021 ◽  
Vol 17 (12) ◽  
pp. e1009036
Author(s):  
Jack Kuipers ◽  
Ariane L. Moore ◽  
Katharina Jahn ◽  
Peter Schraml ◽  
Feng Wang ◽  
...  

Tumour progression is an evolutionary process in which different clones evolve over time, leading to intra-tumour heterogeneity. Interactions between clones can affect tumour evolution and hence disease progression and treatment outcome. Intra-tumoural pairs of mutations that are overrepresented in a co-occurring or clonally exclusive fashion over a cohort of patient samples may be suggestive of a synergistic effect between the different clones carrying these mutations. We therefore developed a novel statistical testing framework, called GeneAccord, to identify such gene pairs that are altered in distinct subclones of the same tumour. We analysed our framework for calibration and power. By comparing its performance to baseline methods, we demonstrate that to control type I errors, it is essential to account for the evolutionary dependencies among clones. In applying GeneAccord to the single-cell sequencing of a cohort of 123 acute myeloid leukaemia patients, we find 1 clonally co-occurring and 8 clonally exclusive gene pairs. The clonally exclusive pairs mostly involve genes of the key signalling pathways.


Sign in / Sign up

Export Citation Format

Share Document