scholarly journals Is Differential Noneffortful Responding Associated with Type I Error in Measurement Invariance Testing?

2021 ◽  
Author(s):  
Joseph Rios

Low test-taking effort as a validity threat is common when examinees perceive an assessment context to have minimal personal value. Prior research has shown that in such contexts subgroups may differ in their effort, which raises two concerns when making subgroup mean comparisons. First, it is unclear how differential effort could influence evaluations of scale property equivalence. Second, if attaining full scalar invariance, the degree to which differential effort can bias subgroup mean comparisons is unknown. To address these issues, a simulation study was conducted to examine the influence of differential noneffortful responding (NER) on evaluations of measurement invariance and latent mean comparisons. Results showed that as differential rates of NER grew, increased type I errors of measurement invariance were observed only at the metric invariance level, while no negative effects were apparent for configural or scalar invariance. When full scalar invariance was correctly attained, differential NER led to bias of mean score comparisons as large as 0.18 standard deviations with a differential NER rate of 7%. These findings suggest that test users should evaluate and document potential differential NER prior to both conducting measurement quality analyses and reporting disaggregated subgroup mean performance.

2021 ◽  
pp. 001316442199042
Author(s):  
Joseph A. Rios

Low test-taking effort as a validity threat is common when examinees perceive an assessment context to have minimal personal value. Prior research has shown that in such contexts, subgroups may differ in their effort, which raises two concerns when making subgroup mean comparisons. First, it is unclear how differential effort could influence evaluations of scale property equivalence. Second, if attaining full scalar invariance, the degree to which differential effort can bias subgroup mean comparisons is unknown. To address these issues, a simulation study was conducted to examine the influence of differential noneffortful responding (NER) on evaluations of measurement invariance and latent mean comparisons. Results showed that as differential rates of NER grew, increased Type I errors of measurement invariance were observed only at the metric invariance level, while no negative effects were apparent for configural or scalar invariance. When full scalar invariance was correctly attained, differential NER led to bias of mean score comparisons as large as 0.18 standard deviations with a differential NER rate of 7%. These findings suggest that test users should evaluate and document potential differential NER prior to both conducting measurement quality analyses and reporting disaggregated subgroup mean performance.


2021 ◽  
Author(s):  
Joseph Rios

Low test-taking effort as a validity threat is common when examinees perceive an assessment context to have minimal personal value. Prior research has shown that in such contexts subgroups may differ in their effort, which raises two concerns when making subgroup mean comparisons. First, it is unclear how differential effort could influence evaluations of scale property equivalence. Second, if attaining full scalar invariance, the degree to which differential effort can bias subgroup mean comparisons is unknown. To address these issues, a simulation study was conducted to examine the influence of differential noneffortful responding (NER) on evaluations of measurement invariance and latent mean comparisons. Results showed that as differential rates of NER grew, increased type I errors of measurement invariance were observed only at the metric invariance level, while no negative effects were apparent for configural or scalar invariance. When full scalar invariance was correctly attained, differential NER led to bias of mean score comparisons as large as 0.18 standard deviations with a differential NER rate of 7%. These findings suggest that test users should evaluate and document potential differential NER prior to both conducting measurement quality analyses and reporting disaggregated subgroup mean performance.


2011 ◽  
Vol 72 (3) ◽  
pp. 469-492 ◽  
Author(s):  
Eun Sook Kim ◽  
Myeongsun Yoon ◽  
Taehun Lee

Multiple-indicators multiple-causes (MIMIC) modeling is often used to test a latent group mean difference while assuming the equivalence of factor loadings and intercepts over groups. However, this study demonstrated that MIMIC was insensitive to the presence of factor loading noninvariance, which implies that factor loading invariance should be tested through other measurement invariance testing techniques. MIMIC modeling is also used for measurement invariance testing by allowing a direct path from a grouping covariate to each observed variable. This simulation study with both continuous and categorical variables investigated the performance of MIMIC in detecting noninvariant variables under various study conditions and showed that the likelihood ratio test of MIMIC with Oort adjustment not only controlled Type I error rates below the nominal level but also maintained high power across study conditions.


2016 ◽  
Vol 32 (4) ◽  
pp. 265-272 ◽  
Author(s):  
Mohsen Joshanloo ◽  
Ali Bakhshi

Abstract. This study investigated the factor structure and measurement invariance of the Mroczek and Kolarz’s scales of positive and negative affect in Iran (N = 2,391) and the USA (N = 2,154), and across gender groups. The two-factor model of affect was supported across the groups. The results of measurement invariance testing confirmed full metric and partial scalar invariance of the scales across cultural groups, and full metric and full scalar invariance across gender groups. The results of latent mean analysis revealed that Iranians scored lower on positive affect and higher on negative affect than Americans. The analyses also showed that American men scored significantly lower than American women on negative affect. The significance and implications of the results are discussed.


1988 ◽  
Vol 13 (3) ◽  
pp. 215-226 ◽  
Author(s):  
H. J. Keselman ◽  
Joanne C. Keselman

Two Tukey multiple comparison procedures as well as a Bonferroni and multivariate approach were compared for their rates of Type I error and any-pairs power when multisample sphericity was not satisfied and the design was unbalanced. Pairwise comparisons of unweighted and weighted repeated measures means were computed. Results indicated that heterogenous covariance matrices in combination with unequal group sizes resulted in substantially inflated rates of Type I error for all MCPs involving comparisons of unweighted means. For tests of weighted means, both the Bonferroni and a multivariate critical value limited the number of Type I errors; however, the Bonferroni procedure provided a more powerful test, particularly when the number of repeated measures treatment levels was large.


Author(s):  
C. Y. Fu ◽  
J. R. Tsay

Since the land surface has been changing naturally or manually, DEMs have to be updated continually to satisfy applications using the latest DEM at present. However, the cost of wide-area DEM production is too high. DEMs, which cover the same area but have different quality, grid sizes, generation time or production methods, are called as multi-source DEMs. It provides a solution to fuse multi-source DEMs for low cost DEM updating. The coverage of DEM has to be classified according to slope and visibility in advance, because the precisions of DEM grid points in different areas with different slopes and visibilities are not the same. Next, difference DEM (dDEM) is computed by subtracting two DEMs. It is assumed that dDEM, which only contains random error, obeys normal distribution. Therefore, student test is implemented for blunder detection and three kinds of rejected grid points are generated. First kind of rejected grid points is blunder points and has to be eliminated. Another one is the ones in change areas, where the latest data are regarded as their fusion result. Moreover, the DEM grid points of type I error are correct data and have to be reserved for fusion. The experiment result shows that using DEMs with terrain classification can obtain better blunder detection result. A proper setting of significant levels (α) can detect real blunders without creating too many type I errors. Weighting averaging is chosen as DEM fusion algorithm. The priori precisions estimated by our national DEM production guideline are applied to define weights. Fisher’s test is implemented to prove that the priori precisions correspond to the RMSEs of blunder detection result.


Methodology ◽  
2016 ◽  
Vol 12 (2) ◽  
pp. 44-51 ◽  
Author(s):  
José Manuel Caperos ◽  
Ricardo Olmos ◽  
Antonio Pardo

Abstract. Correlation analysis is one of the most widely used methods to test hypotheses in social and health sciences; however, its use is not completely error free. We have explored the frequency of inconsistencies between reported p-values and the associated test statistics in 186 papers published in four Spanish journals of psychology (1,950 correlation tests); we have also collected information about the use of one- versus two-tailed tests in the presence of directional hypotheses, and about the use of some kind of adjustment to control Type I errors due to simultaneous inference. Reported correlation tests (83.8%) are incomplete and 92.5% include an inexact p-value. Gross inconsistencies, which are liable to alter the statistical conclusions, appear in 4% of the reviewed tests, and 26.9% of the inconsistencies found were large enough to bias the results of a meta-analysis. The election of one-tailed tests and the use of adjustments to control the Type I error rate are negligible. We therefore urge authors, reviewers, and editorial boards to pay particular attention to this in order to prevent inconsistencies in statistical reports.


2020 ◽  
Vol 43 (3) ◽  
pp. 605-616 ◽  
Author(s):  
Marc J. Lanovaz ◽  
Stéphanie Turgeon

Abstract Design quality guidelines typically recommend that multiple baseline designs include at least three demonstrations of effects. Despite its widespread adoption, this recommendation does not appear grounded in empirical evidence. The main purpose of our study was to address this issue by assessing Type I error rate and power in multiple baseline designs. First, we generated 10,000 multiple baseline graphs, applied the dual-criteria method to each tier, and computed Type I error rate and power for different number of tiers showing a clear change. Second, two raters categorized the tiers for 300 multiple baseline graphs to replicate our analyses using visual inspection. When multiple baseline designs had at least three tiers and two or more of these tiers showed a clear change, the Type I error rate remained adequate (< .05) while power also reached acceptable levels (> .80). In contrast, requiring all tiers to show a clear change resulted in overly stringent conclusions (i.e., unacceptably low power). Therefore, our results suggest that researchers and practitioners should carefully consider limitations in power when requiring all tiers of a multiple baseline design to show a clear change in their analyses.


1994 ◽  
Vol 19 (2) ◽  
pp. 119-126 ◽  
Author(s):  
Ru San Chen ◽  
William P. Dunlap

Lecoutre (1991) has pointed out an error in the Huynh and Feldt (1976) formula for ɛ̃ used to adjust the degree of freedom for an approximate test in repeated measures designs with two or more independent groups. The present simulation study confirms that Lecoutre’s corrected ɛ̃ yields less biased estimation of population ɛ and reduces Type I error rates when compared to Huynh and Feldt’s (1976) ɛ̃. The increased accuracy in Type I errors for group-treatment interactions may become substantial when sample sizes are close to the number of treatment levels.


Sign in / Sign up

Export Citation Format

Share Document