Investigating rater severity/leniency in interpreter performance testing

Interpreting ◽  
2015 ◽  
Vol 17 (2) ◽  
pp. 255-283 ◽  
Author(s):  
Chao Han

Rater-mediated performance assessment (RMPA) is a critical component of interpreter certification testing systems worldwide. Given the acknowledged rater variability in RMPA and the high-stakes nature of certification testing, it is crucial to ensure rater reliability in interpreter certification performance testing (ICPT). However, a review of current ICPT practice indicates that rigorous research on rater reliability is lacking. Against this background, the present study reports on use of multifaceted Rasch measurement (MFRM) to identify the degree of severity/leniency in different raters’ assessments of simultaneous interpretations (SIs) by 32 interpreters in an experimental setting. Nine raters specifically trained for the purpose were asked to evaluate four English-to-Chinese SIs by each of the interpreters, using three 8-point rating scales (information content, fluency, expression). The source texts differed in speed and in the speaker’s accent (native vs non-native). Rater-generated scores were then subjected to MFRM analysis, using the FACETS program. The following general trends emerged: 1) homogeneity statistics showed that not all raters were equally severe overall; and 2) bias analyses showed that a relatively large proportion of the raters had significantly biased interactions with the interpreters and the assessment criteria. Implications for practical rating arrangements in ICPT, and for rater training, are discussed.

Interpreting ◽  
2016 ◽  
Vol 18 (2) ◽  
pp. 225-252 ◽  
Author(s):  
Chao Han ◽  
Helen Slatyer

Over the past decade, interpreter certification performance testing has gained momentum. Certification tests often involve high stakes, since they can play an important role in regulating access to professional practice and serve to provide a measure of professional competence for end users. The decision to award certification is based on inferences from candidates’ test scores about their knowledge, skills and abilities, as well as their interpreting performance in a given target domain. To justify the appropriateness of score-based inferences and actions, test developers need to provide evidence that the test is valid and reliable through a process of test validation. However, there is little evidence that test qualities are systematically evaluated in interpreter certification testing. In an attempt to address this problem, this paper proposes a theoretical argument-based validation framework for interpreter certification performance tests so as to guide testers in carrying out systematic validation research. Before presenting the framework, validity theory is reviewed, and an examination of the argument-based approach to validation is provided. A validity argument for interpreter tests is then proposed, with hypothesized validity evidence. Examples of evidence are drawn from relevant empirical work, where available. Gaps in the available evidence are highlighted and suggestions for research are made.


2017 ◽  
Vol 78 (3) ◽  
pp. 430-459 ◽  
Author(s):  
Iasonas Lamprianou

It is common practice for assessment programs to organize qualifying sessions during which the raters (often known as “markers” or “judges”) demonstrate their consistency before operational rating commences. Because of the high-stakes nature of many rating activities, the research community tends to continuously explore new methods to analyze rating data. We used simulated and empirical data from two high-stakes language assessments, to propose a new approach, based on social network analysis and exponential graph models, to evaluate the readiness of a group of raters for operational rating. The results of this innovative approach are compared with the results of a Rasch analysis, which is a well-established approach for the analysis of such data. We also demonstrate how the new approach can be practically used to investigate important research questions such as whether rater severity is stable across rating tasks. The merits of the new approach, and the consequences for practice are discussed.


2006 ◽  
Vol 28 (3) ◽  
pp. 212-217 ◽  
Author(s):  
Flávia de Lima Osório ◽  
José Alexandre de Souza Crippa ◽  
Sonia Regina Loureiro

OBJECTIVE: To present the translation and validation of the Brief Social Phobia Scale for use in Brazilian Portuguese, to develop a structured interview guide in order to systemize its use and to perform a preliminary study of inter-rater reliability. METHOD: The instrument was translated and adapted to Portuguese by specialists in anxiety disorders and rating scales. A structured interview guide was created with the aim of covering all of the items of the instrument and grouping them into six categories. Specialists in mental health evaluated the guide. These professionals also watched the videotaped interviews of patients with and without social anxiety disorders, and, based on the interview guide, they rated the scale to evaluate its reliability. RESULTS: No semantic or linguistic adjustments were needed. For the complete scale, the general evaluation showed a percentage of agreement of 0.84 and intraclass coefficient of 0.91. The mean inter-rater correlation was 0.84. CONCLUSIONS: The Portuguese-language version of the Brief Social Phobia Scale is available for use in the Brazilian population, with rather acceptable indicators of inter-rater reliability. The interview guide was useful in providing these values. Further studies are needed in order to improve the reliability and to study other psychometric properties of the instrument.


2020 ◽  
Vol 13 (9) ◽  
pp. 94
Author(s):  
Xin Qu

The present study was executed with the purpose of validating ELT Certificate Lesson Observation and Report Task (ELTC-LORT), which was developed by China Language Assessment to certify China’s EFL teachers by performance-based testing. The ELT Certificate has high-stakes considering its impacts on candidates’ recruitment, ELT in China and quality of education, so it is crucially important for its validation so as to guarantee fairness and justice. The validity of task construct and rating rubric went through a process suited for many-facet Rasch measurement supplemented with qualitative interviews. Participants (N = 40) were provided with a video excerpt from a real EFL lesson, and required to deliver a report on the teacher’s performance. Two raters graded the records of the candidates’ reports using rating scales developed to measure EFL teacher candidates’ oral English proficiency and ability to analyze and evaluate teaching. Many-facet Rasch analysis demonstrated a successful estimation, with a noticeable spread among the participants and their traits, proving the task functioned well in measuring candidates’ performance and reflecting the difference of their ability. The raters were found to have good internal self-consistency, but not the same leniency. The rating scales worked well, with the average measures advancing largely in line with Rasch expectations. Semi-structured interviews as well as focus group interviews were executed to provide knowledge regarding the raters’ performance levels and the functionalities of the rating scale items. The findings provide implications for further research and practice of the Certificate.


PEDIATRICS ◽  
2000 ◽  
Vol 105 (Supplement_2) ◽  
pp. 238-241
Author(s):  
Elizabeth H. Morrison ◽  
Janet Palmer Hafler

Resident physicians spend numerous hours every week teaching medical students and fellow residents, and only rarely are they taught how to teach. They can, however, be taught to teach more effectively. Teaching skills improvement initiatives for residents are taking a more prominent place in the educational literature. Limited evidence now suggests that better resident teachers mean better academic performance by learners. A small but important body of research supports selected interventions designed to improve residents' teaching skills, but not all studies have demonstrated significant educational benefits for learners. An increasing number of valid and reliable instruments are available to assess residents' clinical teaching, including objective structured teaching examinations and rating scales. In all specialties, rigorous research in evidence-based teacher training for residents will help prepare academic medical centers to meet the diverse and changing learning needs of today's physicians-in-training.resident physicians, medical students, fellow residents, teaching, graduate medical education.


2019 ◽  
Vol 76 (4) ◽  
pp. 1088-1093
Author(s):  
Nada Gawad ◽  
Amanda Fowler ◽  
Richard Mimeault ◽  
Isabelle Raiche

2019 ◽  
Vol 5 (1) ◽  
pp. e000541 ◽  
Author(s):  
John Ressman ◽  
Wilhelmus Johannes Andreas Grooten ◽  
Eva Rasmussen Barr

Single leg squat (SLS) is a common tool used in clinical examination to set and evaluate rehabilitation goals, but also to assess lower extremity function in active people.ObjectivesTo conduct a review and meta-analysis on the inter-rater and intrarater reliability of the SLS, including the lateral step-down (LSD) and forward step-down (FSD) tests.DesignReview with meta-analysis.Data sourcesCINAHL, Cochrane Library, Embase, Medline (OVID) and Web of Science was searched up until December 2018.Eligibility criteriaStudies were eligible for inclusion if they were methodological studies which assessed the inter-rater and/or intrarater reliability of the SLS, FSD and LSD through observation of movement quality.ResultsThirty-one studies were included. The reliability varied largely between studies (inter-rater: kappa/intraclass correlation coefficients (ICC) = 0.00–0.95; intrarater: kappa/ICC = 0.13–1.00), but most of the studies reached ‘moderate’ measures of agreement. The pooled results of ICC/kappa showed a ‘moderate’ agreement for inter-rater reliability, 0.58 (95% CI 0.50 to 0.65), and a ‘substantial’ agreement for intrarater reliability, 0.68 (95% CI 0.60 to 0.74). Subgroup analyses showed a higher pooled agreement for inter-rater reliability of ≤3-point rating scales while no difference was found for different numbers of segmental assessments.ConclusionOur findings indicate that the SLS test including the FSD and LSD tests can be suitable for clinical use regardless of number of observed segments and particularly with a ≤3-point rating scale. Since most of the included studies were affected with some form of methodological bias, our findings must be interpreted with caution.PROSPERO registration numberCRD42018077822.


2020 ◽  
pp. 001316442093486
Author(s):  
Niklas Schulte ◽  
Heinz Holling ◽  
Paul-Christian Bürkner

Forced-choice questionnaires can prevent faking and other response biases typically associated with rating scales. However, the derived trait scores are often unreliable and ipsative, making interindividual comparisons in high-stakes situations impossible. Several studies suggest that these problems vanish if the number of measured traits is high. To determine the necessary number of traits under varying sample sizes, factor loadings, and intertrait correlations, simulations were performed for the two most widely used scoring methods, namely the classical (ipsative) approach and Thurstonian item response theory (IRT) models. Results demonstrate that while especially Thurstonian IRT models perform well under ideal conditions, both methods yield insufficient reliabilities in most conditions resembling applied contexts. Moreover, not only the classical estimates but also the Thurstonian IRT estimates for questionnaires with equally keyed items remain (partially) ipsative, even when the number of traits is very high (i.e., 30). This result not only questions earlier assumptions regarding the use of classical scores in high-dimensional questionnaires, but it also raises doubts about many validation studies on Thurstonian IRT models because correlations of (partially) ipsative scores with external criteria cannot be interpreted in a usual way.


Author(s):  
Alexandra de Raadt ◽  
Matthijs J. Warrens ◽  
Roel J. Bosker ◽  
Henk A. L. Kiers

AbstractKappa coefficients are commonly used for quantifying reliability on a categorical scale, whereas correlation coefficients are commonly applied to assess reliability on an interval scale. Both types of coefficients can be used to assess the reliability of ordinal rating scales. In this study, we compare seven reliability coefficients for ordinal rating scales: the kappa coefficients included are Cohen’s kappa, linearly weighted kappa, and quadratically weighted kappa; the correlation coefficients included are intraclass correlation ICC(3,1), Pearson’s correlation, Spearman’s rho, and Kendall’s tau-b. The primary goal is to provide a thorough understanding of these coefficients such that the applied researcher can make a sensible choice for ordinal rating scales. A second aim is to find out whether the choice of the coefficient matters. We studied to what extent we reach the same conclusions about inter-rater reliability with different coefficients, and to what extent the coefficients measure agreement in a similar way, using analytic methods, and simulated and empirical data. Using analytical methods, it is shown that differences between quadratic kappa and the Pearson and intraclass correlations increase if agreement becomes larger. Differences between the three coefficients are generally small if differences between rater means and variances are small. Furthermore, using simulated and empirical data, it is shown that differences between all reliability coefficients tend to increase if agreement between the raters increases. Moreover, for the data in this study, the same conclusion about inter-rater reliability was reached in virtually all cases with the four correlation coefficients. In addition, using quadratically weighted kappa, we reached a similar conclusion as with any correlation coefficient a great number of times. Hence, for the data in this study, it does not really matter which of these five coefficients is used. Moreover, the four correlation coefficients and quadratically weighted kappa tend to measure agreement in a similar way: their values are very highly correlated for the data in this study.


Sign in / Sign up

Export Citation Format

Share Document