scholarly journals Group Sequential Designs: A Tutorial

2021 ◽  
Author(s):  
Daniel Lakens ◽  
Friedrich Pahlke ◽  
Gernot Wassmer

This tutorial illustrates how to design, analyze, and report group sequential designs. In these designs, groups of observations are collected and repeatedly analyzed, while controlling error rates. Compared to a fixed sample size design, where data is analyzed only once, group sequential designs offer the possibility to stop the study at interim looks at the data either for efficacy or futility. Hence, they provide greater flexibility and are more efficient in the sense that due to early stopping the expected sample size is smaller as compared to the sample size in the design with no interim look. In this tutorial we illustrate how to use the R package 'rpact' and the associated Shiny app to design studies that control the Type I error rate when repeatedly analyzing data, even when neither the number of looks at the data, nor the exact timing of looks at the data, is specified. Specifically for *t*-tests, we illustrate how to perform an a-priori power analysis for group sequential designs, and explain how to stop the data collection for futility by rejecting the presence of an effect of interest based on a beta-spending function. Finally, we discuss how to report adjusted effect size estimates and confidence intervals. The recent availability of accessible software such as 'rpact' makes it possible for psychologists to benefit from the efficiency gains provided by group sequential designs.

2018 ◽  
Vol 28 (8) ◽  
pp. 2385-2403 ◽  
Author(s):  
Tobias Mütze ◽  
Ekkehard Glimm ◽  
Heinz Schmidli ◽  
Tim Friede

Robust semiparametric models for recurrent events have received increasing attention in the analysis of clinical trials in a variety of diseases including chronic heart failure. In comparison to parametric recurrent event models, robust semiparametric models are more flexible in that neither the baseline event rate nor the process inducing between-patient heterogeneity needs to be specified in terms of a specific parametric statistical model. However, implementing group sequential designs in the robust semiparametric model is complicated by the fact that the sequence of Wald statistics does not follow asymptotically the canonical joint distribution. In this manuscript, we propose two types of group sequential procedures for a robust semiparametric analysis of recurrent events. The first group sequential procedure is based on the asymptotic covariance of the sequence of Wald statistics and it guarantees asymptotic control of the type I error rate. The second procedure is based on the canonical joint distribution and does not guarantee asymptotic type I error rate control but is easy to implement and corresponds to the well-known standard approach for group sequential designs. Moreover, we describe how to determine the maximum information when planning a clinical trial with a group sequential design and a robust semiparametric analysis of recurrent events. We contrast the operating characteristics of the proposed group sequential procedures in a simulation study motivated by the ongoing phase 3 PARAGON-HF trial (ClinicalTrials.gov identifier: NCT01920711) in more than 4600 patients with chronic heart failure and a preserved ejection fraction. We found that both group sequential procedures have similar operating characteristics and that for some practically relevant scenarios, the group sequential procedure based on the canonical joint distribution has advantages with respect to the control of the type I error rate. The proposed method for calculating the maximum information results in appropriately powered trials for both procedures.


1990 ◽  
Vol 9 (12) ◽  
pp. 1439-1445 ◽  
Author(s):  
Irving K. Hwang ◽  
Weichung J. Shih ◽  
John S. De Cani

2020 ◽  
Author(s):  
Pauline Manchon ◽  
Drifa Belhadi ◽  
France Mentré ◽  
Cédric Laouénan

Abstract Background Viral haemorrhagic fevers are characterized by irregular outbreaks with high mortality rate. Difficulties arise when implementing therapeutic trials in this context. The outbreak duration is hard to predict and can be short compared to delays of trial launch and number of subject needed (NSN) recruitment. Our objectives were to compare, using clinical trial simulation, different trial designs for experimental treatment evaluation in various outbreak scenarios. Methods Four type of designs were compared: fixed or group-sequential, each being single- or two-arm. The primary outcome was 14-day survival rate. For single-arm designs, results were compared to a pre-trial historical survival rate pH. Treatments efficacy was evaluated by one-sided tests of proportion (fixed designs) and Whitehead triangular tests (group-sequential designs) with type-I-error = 0.025. Both survival rates in the control arm pC and survival rate differences Δ (including 0) varied. Three specific cases were considered: “standard” (fixed pC, reaching NSN for fixed designs and maximum sample size NMax for group-sequential designs); “changing with time” (increased pC\(\text{ }\)over time); “stopping of recruitment” (epidemic ends). We calculated the proportion of simulated trials showing treatment efficacy, with K = 93,639 simulated trials to get a type-I-error PI95% of [0.024;0.026]. Results Under H0 (Δ = 0), for the “standard” case, the type-I-error was maintained regardless of trial designs. For “changing with time” case, when pC>pH, type-I-error was inflated, and when pC<pH it decreased. Wrong conclusions were more often observed for single-arm designs due to an increase of Δ over time. Under H1 (Δ=+0.2), for the “standard” case, the power was similar between single- and two-arm designs when pC=pH. For “stopping of recruitment” case, single-arm performed better than two-arm designs, and fixed designs reported higher power than group-sequential designs. A web R-Shiny application was developed. Conclusions At an outbreak beginning, group-sequential two-arm trials should be preferred, as the infected cases number increases allowing to conduct a strong randomized control trial. Group-sequential designs allow early termination of trials in cases of harmful experimental treatment. After the epidemic peak, fixed single-arm design should be preferred, as the cases number decreases but this assumes a high level of confidence on the pre-trial historical survival rate.


2020 ◽  
Author(s):  
Keith Lohse ◽  
Kristin Sainani ◽  
J. Andrew Taylor ◽  
Michael Lloyd Butson ◽  
Emma Knight ◽  
...  

Magnitude-based inference (MBI) is a controversial statistical method that has been used in hundreds of papers in sports science despite criticism from statisticians. To better understand how this method has been applied in practice, we systematically reviewed 232 papers that used MBI. We extracted data on study design, sample size, and choice of MBI settings and parameters. Median sample size was 10 per group (interquartile range, IQR: 8 – 15) for multi-group studies and 14 (IQR: 10 – 24) for single-group studies; few studies reported a priori sample size calculations (15%). Authors predominantly applied MBI’s default settings and chose “mechanistic/non-clinical” rather than “clinical” MBI even when testing clinical interventions (only 14 studies out of 232 used clinical MBI). Using these data, we can estimate the Type I error rates for the typical MBI study. Authors frequently made dichotomous claims about effects based on the MBI criterion of a “likely” effect and sometimes based on the MBI criterion of a “possible” effect. When the sample size is n=8 to 15 per group, these inferences have Type I error rates of 12%-22% and 22%-45%, respectively. High Type I error rates were compounded by multiple testing: Authors reported results from a median of 30 tests related to outcomes; and few studies specified a primary outcome (14%). We conclude that MBI has promoted small studies, promulgated a “black box” approach to statistics, and led to numerous papers where the conclusions are not supported by the data. Amidst debates over the role of p-values and significance testing in science, MBI also provides an important natural experiment: we find no evidence that moving researchers away from p-values or null hypothesis significance testing makes them less prone to dichotomization or over-interpretation of findings.


2019 ◽  
Vol 14 (2) ◽  
pp. 399-425 ◽  
Author(s):  
Haolun Shi ◽  
Guosheng Yin

2019 ◽  
Author(s):  
Rob Cribbie ◽  
Nataly Beribisky ◽  
Udi Alter

Many bodies recommend that a sample planning procedure, such as traditional NHST a priori power analysis, is conducted during the planning stages of a study. Power analysis allows the researcher to estimate how many participants are required in order to detect a minimally meaningful effect size at a specific level of power and Type I error rate. However, there are several drawbacks to the procedure that render it “a mess.” Specifically, the identification of the minimally meaningful effect size is often difficult but unavoidable for conducting the procedure properly, the procedure is not precision oriented, and does not guide the researcher to collect as many participants as feasibly possible. In this study, we explore how these three theoretical issues are reflected in applied psychological research in order to better understand whether these issues are concerns in practice. To investigate how power analysis is currently used, this study reviewed the reporting of 443 power analyses in high impact psychology journals in 2016 and 2017. It was found that researchers rarely use the minimally meaningful effect size as a rationale for the chosen effect in a power analysis. Further, precision-based approaches and collecting the maximum sample size feasible are almost never used in tandem with power analyses. In light of these findings, we offer that researchers should focus on tools beyond traditional power analysis when sample planning, such as collecting the maximum sample size feasible.


1992 ◽  
Vol 71 (1) ◽  
pp. 3-14 ◽  
Author(s):  
John E. Overall ◽  
Robert S. Atlas

A statistical model for combining p values from multiple tests of significance is used to define rejection and acceptance regions for two-stage and three-stage sampling plans. Type I error rates, power, frequencies of early termination decisions, and expected sample sizes are compared. Both the two-stage and three-stage procedures provide appropriate protection against Type I errors. The two-stage sampling plan with its single interim analysis entails minimal loss in power and provides substantial reduction in expected sample size as compared with a conventional single end-of-study test of significance for which power is in the adequate range. The three-stage sampling plan with its two interim analyses introduces somewhat greater reduction in power, but it compensates with greater reduction in expected sample size. Either interim-analysis strategy is more efficient than a single end-of-study analysis in terms of power per unit of sample size.


2018 ◽  
Vol 28 (7) ◽  
pp. 2179-2195 ◽  
Author(s):  
Chieh Chiang ◽  
Chin-Fu Hsiao

Multiregional clinical trials have been accepted in recent years as a useful means of accelerating the development of new drugs and abridging their approval time. The statistical properties of multiregional clinical trials are being widely discussed. In practice, variance of a continuous response may be different from region to region, but it leads to the assessment of the efficacy response falling into a Behrens–Fisher problem—there is no exact testing or interval estimator for mean difference with unequal variances. As a solution, this study applies interval estimations of the efficacy response based on Howe’s, Cochran–Cox’s, and Satterthwaite’s approximations, which have been shown to have well-controlled type I error rates. However, the traditional sample size determination cannot be applied to the interval estimators. The sample size determination to achieve a desired power based on these interval estimators is then presented. Moreover, the consistency criteria suggested by the Japanese Ministry of Health, Labour and Welfare guidance to decide whether the overall results from the multiregional clinical trial obtained via the proposed interval estimation were also applied. A real example is used to illustrate the proposed method. The results of simulation studies indicate that the proposed method can correctly determine the required sample size and evaluate the assurance probability of the consistency criteria.


2019 ◽  
Vol 3 (Supplement_1) ◽  
Author(s):  
Keisuke Ejima ◽  
Andrew Brown ◽  
Daniel Smith ◽  
Ufuk Beyaztas ◽  
David Allison

Abstract Objectives Rigor, reproducibility and transparency (RRT) awareness has expanded over the last decade. Although RRT can be improved from various aspects, we focused on type I error rates and power of commonly used statistical analyses testing mean differences of two groups, using small (n ≤ 5) to moderate sample sizes. Methods We compared data from five distinct, homozygous, monogenic, murine models of obesity with non-mutant controls of both sexes. Baseline weight (7–11 weeks old) was the outcome. To examine whether type I error rate could be affected by choice of statistical tests, we adjusted the empirical distributions of weights to ensure the null hypothesis (i.e., no mean difference) in two ways: Case 1) center both weight distributions on the same mean weight; Case 2) combine data from control and mutant groups into one distribution. From these cases, 3 to 20 mice were resampled to create a ‘plasmode’ dataset. We performed five common tests (Student's t-test, Welch's t-test, Wilcoxon test, permutation test and bootstrap test) on the plasmodes and computed type I error rates. Power was assessed using plasmodes, where the distribution of the control group was shifted by adding a constant value as in Case 1, but to realize nominal effect sizes. Results Type I error rates were unreasonably higher than the nominal significance level (type I error rate inflation) for Student's t-test, Welch's t-test and permutation especially when sample size was small for Case 1, whereas inflation was observed only for permutation for Case 2. Deflation was noted for bootstrap with small sample. Increasing sample size mitigated inflation and deflation, except for Wilcoxon in Case 1 because heterogeneity of weight distributions between groups violated assumptions for the purposes of testing mean differences. For power, a departure from the reference value was observed with small samples. Compared with the other tests, bootstrap was underpowered with small samples as a tradeoff for maintaining type I error rates. Conclusions With small samples (n ≤ 5), bootstrap avoided type I error rate inflation, but often at the cost of lower power. To avoid type I error rate inflation for other tests, sample size should be increased. Wilcoxon should be avoided because of heterogeneity of weight distributions between mutant and control mice. Funding Sources This study was supported in part by NIH and Japan Society for Promotion of Science (JSPS) KAKENHI grant.


Sign in / Sign up

Export Citation Format

Share Document