Cavalier Use of Inferential Statistics Is a Major Source of False and Irreproducible Scientific Findings

Leonid Hanin

doi:10.3390/math9060603

Cavalier Use of Inferential Statistics Is a Major Source of False and Irreproducible Scientific Findings

Mathematics ◽

10.3390/math9060603 ◽

2021 ◽

Vol 9 (6) ◽

pp. 603

Author(s):

Leonid Hanin

Keyword(s):

Sample Size ◽

Gaussian Approximation ◽

Statistical Significance ◽

Statistical Analyses ◽

Random Sample Size ◽

P Values ◽

The Central Limit Theorem ◽

Fixed Sample ◽

Large Numbers ◽

Significance Levels

I uncover previously underappreciated systematic sources of false and irreproducible results in natural, biomedical and social sciences that are rooted in statistical methodology. They include the inevitably occurring deviations from basic assumptions behind statistical analyses and the use of various approximations. I show through a number of examples that (a) arbitrarily small deviations from distributional homogeneity can lead to arbitrarily large deviations in the outcomes of statistical analyses; (b) samples of random size may violate the Law of Large Numbers and thus are generally unsuitable for conventional statistical inference; (c) the same is true, in particular, when random sample size and observations are stochastically dependent; and (d) the use of the Gaussian approximation based on the Central Limit Theorem has dramatic implications for p-values and statistical significance essentially making pursuit of small significance levels and p-values for a fixed sample size meaningless. The latter is proven rigorously in the case of one-sided Z test. This article could serve as a cautionary guidance to scientists and practitioners employing statistical methods in their work.

Download Full-text

Bayesian interpretation of p values in clinical trials

BMJ evidence-based medicine ◽

10.1136/bmjebm-2020-111603 ◽

2021 ◽

pp. bmjebm-2020-111603

Author(s):

John Ferguson

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Sample Size ◽

Confidence Intervals ◽

Statistical Significance ◽

Large Sample Size ◽

P Values ◽

Clinical Trial Results ◽

Sound Treatment ◽

Counterintuitive Result

Commonly accepted statistical advice dictates that large-sample size and highly powered clinical trials generate more reliable evidence than trials with smaller sample sizes. This advice is generally sound: treatment effect estimates from larger trials tend to be more accurate, as witnessed by tighter confidence intervals in addition to reduced publication biases. Consider then two clinical trials testing the same treatment which result in the same p values, the trials being identical apart from differences in sample size. Assuming statistical significance, one might at first suspect that the larger trial offers stronger evidence that the treatment in question is truly effective. Yet, often precisely the opposite will be true. Here, we illustrate and explain this somewhat counterintuitive result and suggest some ramifications regarding interpretation and analysis of clinical trial results.

Download Full-text

Visualization Strategies for Regression Estimates with Randomization Inference

10.31235/osf.io/bsd7g ◽

2019 ◽

Author(s):

Marshall A. Taylor

Keyword(s):

Confidence Interval ◽

Confidence Intervals ◽

Regression Models ◽

Statistical Significance ◽

Permutation Tests ◽

P Value ◽

P Values ◽

Alpha Level ◽

Significance Levels ◽

Nonprobability Sample

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically non-significant at at least the alpha-level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this paper, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate's p-value and its associated confidence interval in relation to a specified alpha-level. These plots can help the analyst interpret and report both the statistical and substantive significance of their models. Illustrations are provided using a nonprobability sample of activists and participants at a 1962 anti-Communism school.

Download Full-text

New exact Bayesian prediction of the range for the exponential lifetime based on fixed and random sample sizes

International Journal of Algebra and Statistics ◽

10.20454/ijas.2017.1369 ◽

2017 ◽

Vol 6 (1-2) ◽

pp. 169

Author(s):

A. H. Abd Ellah

Keyword(s):

Sample Size ◽

Random Sample ◽

Real Data ◽

Predictive Distribution ◽

Data Sets ◽

Life Testing ◽

Random Sample Size ◽

Fixed Sample Size ◽

Fixed Sample ◽

Special Case

We consider the problem of predictive interval for the range of the future observations from an exponential distribution. Two cases are considered, (1) Fixed sample size (FSS). (2) Random sample size (RSS). Further, I derive the predictive function for both FSS and RSS in closely forms. Random sample size is appeared in many application of life testing. Fixed sample size is a special case from the case of random sample size. Illustrative examples are given. Factors of the predictive distribution are given. A comparison in savings is made with the above method. To show the applications of our results, we present some simulation experiments. Finally, we apply our results to some real data sets in life testing.

Download Full-text

Visualization strategies for regression estimates with randomization inference

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x20930999 ◽

2020 ◽

Vol 20 (2) ◽

pp. 309-335

Author(s):

Marshall A. Taylor

Keyword(s):

Confidence Interval ◽

Confidence Intervals ◽

Regression Models ◽

Statistical Significance ◽

Permutation Tests ◽

P Value ◽

P Values ◽

Alpha Level ◽

Significance Levels ◽

Nonprobability Sample

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically nonsignificant at least at the alpha level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this article, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate’s p-value and its associated confidence interval in relation to a specified alpha level. These plots can help the analyst interpret and report the statistical and substantive significances of their models. I illustrate using a nonprobability sample of activists and participants at a 1962 anticommunism school.

Download Full-text

The weak convergence of the empirical process with random sample size

Mathematical Proceedings of the Cambridge Philosophical Society ◽

10.1017/s0305004100042663 ◽

1968 ◽

Vol 64 (1) ◽

pp. 155-160 ◽

Cited By ~ 21

Author(s):

Ronald Pyke

Keyword(s):

Sample Size ◽

Random Sample ◽

Independent Random Variables ◽

Fixed Cost ◽

Operating Characteristics ◽

Probability Models ◽

Random Sample Size ◽

Statistical Inferences ◽

Fixed Sample ◽

Common Distribution Function

In many applied probability models, one is concerned with a sequence {Xn: n > 1} of independent random variables (r.v.'s) with a common distribution function (d.f.), F say. When making statistical inferences within such a model, one frequently must do so on the basis of observations X1, X2,…, XN where the sample size N is a r.v. For example, N might be the number of observations that it was possible to take within a given period of time or within a fixed cost of experimentation. In cases such as these it is not uncommon for statisticians to use fixed-sample-size techniques, even though the random sample size, N, is not independent of the sample. It is therefore important to investigate the operating characteristics of these techniques under random sample sizes. Much work has been done since 1952 on this problem for techniques based on the sum, X1 + … + XN (see, for example, the references in (3)). Also, for techniques based on max(X1, X2, …, XN), results have been obtained independently by Barndorff-Nielsen(2) and Lamperti(9).

Download Full-text

Insights into Criteria for Statistical Significance from Signal Detection Analysis

Meta-Psychology ◽

10.15626/mp.2018.871 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 2

Author(s):

Jessica K. Witt

Keyword(s):

Signal Detection ◽

Sample Size ◽

Effect Size ◽

Statistical Significance ◽

Bayes Factors ◽

False Alarms ◽

P Values ◽

Questionable Research Practices ◽

Detection Analysis ◽

Signal Detection Analysis

What is best criterion for determining statistical significance? In psychology, the criterion has been p < .05. This criterion has been criticized since its inception, and the criticisms have been rejuvenated with recent failures to replicate studies published in top psychology journals. Several replacement criteria have been suggested including reducing the alpha level to .005 or switching to other types of criteria such as Bayes factors or effect sizes. Here, various decision criteria for statistical significance were evaluated using signal detection analysis on the outcomes of simulated data. The signal detection measure of area under the curve (AUC) is a measure of discriminability with a value of 1 indicating perfect discriminability and 0.5 indicating chance performance. Applied to criteria for statistical significance, it provides an estimate of the decision criterion’s performance in discriminating real effects from null effects. AUCs were high (M = .96, median = .97) for p values, suggesting merit in using p values to discriminate significant effects. AUCs can be used to assess methodological questions such as how much improvement will be gained with increased sample size, how much discriminability will be lost with questionable research practices, and whether it is better to run a single high-powered study or a study plus a replication at lower powers. AUCs were also used to compare performance across p values, Bayes factors, and effect size (Cohen’s d). AUCs were equivalent for p values and Bayes factors and were slightly higher for effect size. Signal detection analysis provides separate measures of discriminability and bias. With respect to bias, the specific thresholds that produced maximally-optimal utility depended on sample size, although this dependency was particularly notable for p values and less so for Bayes factors. The application of signal detection theory to the issue of statistical significance highlights the need to focus on both false alarms and misses, rather than false alarms alone.

Download Full-text

Statistical Significance

Research Methods in the Social Sciences: An A-Z of key concepts ◽

10.1093/hepl/9780198850298.003.0063 ◽

2021 ◽

pp. 269-272

Author(s):

Jean-Frédéric Morin ◽

Christian Olsson ◽

Ece Özlem Atikcan

Keyword(s):

Quantitative Analysis ◽

Sample Size ◽

Null Hypothesis ◽

Statistical Significance ◽

Statistical Test ◽

Positive Answer ◽

Single Sample ◽

P Values ◽

Sampling Process

This chapter highlights statistical significance. The key question in quantitative analysis is whether a pattern observed in a sample also holds for the population from which the sample was drawn. A positive answer to this question implies that the result is ‘statistically significant’ — i.e. it was not produced by a random variation from sample to sample, but, instead, reflects the pattern that exists in the population. The null hypothesis statistical test (NHST) has been a widely used approach for testing whether inference from a sample to the population is valid. Seeking to test whether valid inferences about the population could be made based on the results from a single sample, a researcher should consider a wide variety of approaches and take into the account not only p-values, but also sampling process, sample size, the quality of measurement, and other factors that may influence the reliability of estimates.

Download Full-text

Testing for baseline differences in clinical trials

International Journal of Clinical Trials ◽

10.18203/2349-3259.ijct20201720 ◽

2020 ◽

Vol 7 (2) ◽

pp. 150

Author(s):

Henian Chen ◽

Yuanyuan Lu ◽

Nicole Slye

Keyword(s):

Clinical Trials ◽

Sample Size ◽

Sofa Score ◽

Statistical Tests ◽

Statistical Significance ◽

P Value ◽

Large Trial ◽

P Values ◽

Failure Assessment ◽

The Relationship

<p class="abstract">Reporting statistical tests for baseline measures of clinical trials does not make sense since the statistical significance is dependent on sample size, as a large trial can find significance in the same difference that a small trial did not find to be statistically significant. We use 3 published trials using the same baseline measures to provide the relationship between trial sample size and p value. For trial 1 sequential organ failure assessment (SOFA) score, p=0.01, 10.4±3.4 vs. 9.6±3.2, difference=0.8; p=0.007 for vasopressors, 83.0% vs. 72.6%. Trial 2 has SOFA score 11±3 vs. 12±3, difference=1, p=0.42. Trial 3 has vasopressors 73% vs. 83%, p=0.21. Based on trial 2, supine group has a mean of 12 and an SD of 3 for SOFA score, while prone group has a mean of 11 and an SD of 3 for SOFA score. The p values are 0.29850, 0.09877, 0.01940, 0.00094, 0.00005, and <0.00001 when n (per arm) is 20, 50, 100, 200, 300 and 400, respectively. Based on trial 3 information, the vasopressors percentages are 73.0% in the supine group vs. 83.0% in the prone group. The p values are 0.4452, 0.2274, 0.0878, 0.0158, 0.0031, and 0.0006 when n (per arm) is 20, 50, 100, 200, 300 and 400, respectively. Small trials provide larger p values than big trials for the same baseline differences. We cannot define the imbalance in baseline measures only based on these p values. There is no statistical basis for advocating the baseline difference tests</p>

Download Full-text

Identification of best indicators of peptide-spectrum match using a permutation resampling approach

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720014400010 ◽

2014 ◽

Vol 12 (05) ◽

pp. 1440001 ◽

Cited By ~ 3

Author(s):

Malik N. Akhtar ◽

Bruce R. Southey ◽

Per E. Andrén ◽

Jonathan V. Sweedler ◽

Sandra L. Rodriguez-Zas

Keyword(s):

Mass Spectra ◽

Statistical Significance ◽

Permutation Tests ◽

Database Search ◽

Theoretical Spectrum ◽

P Values ◽

Tandem Mass Spectra ◽

Wide Range ◽

Significance Levels ◽

Peptide Match

Various indicators of observed-theoretical spectrum matches were compared and the resulting statistical significance was characterized using permutation resampling. Novel decoy databases built by resampling the terminal positions of peptide sequences were evaluated to identify the conditions for accurate computation of peptide match significance levels. The methodology was tested on real and manually curated tandem mass spectra from peptides across a wide range of sizes. Spectra match indicators from complementary database search programs were profiled and optimal indicators were identified. The combination of the optimal indicator and permuted decoy databases improved the calculation of the peptide match significance compared to the approaches currently implemented in the database search programs that rely on distributional assumptions. Permutation tests using p-values obtained from software-dependent matching scores and E-values outperformed permutation tests using all other indicators. The higher overlap in matches between the database search programs when using end permutation compared to existing approaches confirmed the superiority of the end permutation method to identify peptides. The combination of effective match indicators and the end permutation method is recommended for accurate detection of peptides.

Download Full-text

A Note on the Accuracy of Normal Approximation of Random Quantities

Calcutta Statistical Association Bulletin ◽

10.1177/00080683211013510 ◽

2021 ◽

Vol 73 (1) ◽

pp. 62-67

Author(s):

Ibrahim A. Ahmad ◽

A. R. Mugdadi

Keyword(s):

Sample Size ◽

Order Statistic ◽

Normal Approximation ◽

Order Approximation ◽

Random Variables ◽

Random Variable ◽

Limiting Distribution ◽

Exact Order ◽

Fixed Sample ◽

Independent Identically Distributed

For a sequence of independent, identically distributed random variable (iid rv's) [Formula: see text] and a sequence of integer-valued random variables [Formula: see text], define the random quantiles as [Formula: see text], where [Formula: see text] denote the largest integer less than or equal to [Formula: see text], and [Formula: see text] the [Formula: see text]th order statistic in a sample [Formula: see text] and [Formula: see text]. In this note, the limiting distribution and its exact order approximation are obtained for [Formula: see text]. The limiting distribution result we obtain extends the work of several including Wretman[Formula: see text]. The exact order of normal approximation generalizes the fixed sample size results of Reiss[Formula: see text]. AMS 2000 subject classification: 60F12; 60F05; 62G30.

Download Full-text