Investigating rater severity/leniency in interpreter performance testing

Rater-mediated performance assessment (RMPA) is a critical component of interpreter certification testing systems worldwide. Given the acknowledged rater variability in RMPA and the high-stakes nature of certification testing, it is crucial to ensure rater reliability in interpreter certification performance testing (ICPT). However, a review of current ICPT practice indicates that rigorous research on rater reliability is lacking. Against this background, the present study reports on use of multifaceted Rasch measurement (MFRM) to identify the degree of severity/leniency in different raters’ assessments of simultaneous interpretations (SIs) by 32 interpreters in an experimental setting. Nine raters specifically trained for the purpose were asked to evaluate four English-to-Chinese SIs by each of the interpreters, using three 8-point rating scales (information content, fluency, expression). The source texts differed in speed and in the speaker’s accent (native vs non-native). Rater-generated scores were then subjected to MFRM analysis, using the FACETS program. The following general trends emerged: 1) homogeneity statistics showed that not all raters were equally severe overall; and 2) bias analyses showed that a relatively large proportion of the raters had significantly biased interactions with the interpreters and the assessment criteria. Implications for practical rating arrangements in ICPT, and for rater training, are discussed.

Download Full-text

Test validation in interpreter certification performance testing

Interpreting ◽

10.1075/intp.18.2.04han ◽

2016 ◽

Vol 18 (2) ◽

pp. 225-252 ◽

Cited By ~ 5

Author(s):

Chao Han ◽

Helen Slatyer

Keyword(s):

Professional Competence ◽

Empirical Work ◽

Performance Testing ◽

Test Validation ◽

Theoretical Argument ◽

Target Domain ◽

High Stakes ◽

Certification Tests ◽

Certification Testing ◽

Validation Research

Over the past decade, interpreter certification performance testing has gained momentum. Certification tests often involve high stakes, since they can play an important role in regulating access to professional practice and serve to provide a measure of professional competence for end users. The decision to award certification is based on inferences from candidates’ test scores about their knowledge, skills and abilities, as well as their interpreting performance in a given target domain. To justify the appropriateness of score-based inferences and actions, test developers need to provide evidence that the test is valid and reliable through a process of test validation. However, there is little evidence that test qualities are systematically evaluated in interpreter certification testing. In an attempt to address this problem, this paper proposes a theoretical argument-based validation framework for interpreter certification performance tests so as to guide testers in carrying out systematic validation research. Before presenting the framework, validity theory is reviewed, and an examination of the argument-based approach to validation is provided. A validity argument for interpreter tests is then proposed, with hypothesized validity evidence. Examples of evidence are drawn from relevant empirical work, where available. Gaps in the available evidence are highlighted and suggestions for research are made.

Download Full-text

Investigation of Rater Effects Using Social Network Analysis and Exponential Random Graph Models

Educational and Psychological Measurement ◽

10.1177/0013164416689696 ◽

2017 ◽

Vol 78 (3) ◽

pp. 430-459 ◽

Cited By ~ 3

Author(s):

Iasonas Lamprianou

Keyword(s):

Social Network ◽

Social Network Analysis ◽

Network Analysis ◽

Exponential Random Graph Models ◽

Graph Models ◽

High Stakes ◽

Rating Data ◽

New Approach ◽

Exponential Random Graph ◽

Rater Severity

It is common practice for assessment programs to organize qualifying sessions during which the raters (often known as “markers” or “judges”) demonstrate their consistency before operational rating commences. Because of the high-stakes nature of many rating activities, the research community tends to continuously explore new methods to analyze rating data. We used simulated and empirical data from two high-stakes language assessments, to propose a new approach, based on social network analysis and exponential graph models, to evaluate the readiness of a group of raters for operational rating. The results of this innovative approach are compared with the results of a Rasch analysis, which is a well-established approach for the analysis of such data. We also demonstrate how the new approach can be practically used to investigate important research questions such as whether rater severity is stable across rating tasks. The merits of the new approach, and the consequences for practice are discussed.

Download Full-text

Cross-cultural validation of the Brief Social Phobia Scale for use in Portuguese and the development of a structured interview guide

Brazilian Journal of Psychiatry ◽

10.1590/s1516-44462006000300014 ◽

2006 ◽

Vol 28 (3) ◽

pp. 212-217 ◽

Cited By ~ 9

Author(s):

Flávia de Lima Osório ◽

José Alexandre de Souza Crippa ◽

Sonia Regina Loureiro

Keyword(s):

Anxiety Disorders ◽

Social Phobia ◽

Rating Scales ◽

Structured Interview ◽

Rater Reliability ◽

Language Version ◽

General Evaluation ◽

Reliability Method ◽

Interview Guide ◽

Translation And Validation

OBJECTIVE: To present the translation and validation of the Brief Social Phobia Scale for use in Brazilian Portuguese, to develop a structured interview guide in order to systemize its use and to perform a preliminary study of inter-rater reliability. METHOD: The instrument was translated and adapted to Portuguese by specialists in anxiety disorders and rating scales. A structured interview guide was created with the aim of covering all of the items of the instrument and grouping them into six categories. Specialists in mental health evaluated the guide. These professionals also watched the videotaped interviews of patients with and without social anxiety disorders, and, based on the interview guide, they rated the scale to evaluate its reliability. RESULTS: No semantic or linguistic adjustments were needed. For the complete scale, the general evaluation showed a percentage of agreement of 0.84 and intraclass coefficient of 0.91. The mean inter-rater correlation was 0.84. CONCLUSIONS: The Portuguese-language version of the Brief Social Phobia Scale is available for use in the Brazilian population, with rather acceptable indicators of inter-rater reliability. The interview guide was useful in providing these values. Further studies are needed in order to improve the reliability and to study other psychometric properties of the instrument.

Download Full-text

A Rasch-Based Validation of ELT Certificate-LORT

English Language Teaching ◽

10.5539/elt.v13n9p94 ◽

2020 ◽

Vol 13 (9) ◽

pp. 94

Author(s):

Xin Qu

Keyword(s):

Teacher Candidates ◽

Rating Scales ◽

Rating Scale ◽

Qualitative Interviews ◽

Structured Interviews ◽

High Stakes ◽

Efl Teacher ◽

The Difference ◽

Group Interviews

The present study was executed with the purpose of validating ELT Certificate Lesson Observation and Report Task (ELTC-LORT), which was developed by China Language Assessment to certify China’s EFL teachers by performance-based testing. The ELT Certificate has high-stakes considering its impacts on candidates’ recruitment, ELT in China and quality of education, so it is crucially important for its validation so as to guarantee fairness and justice. The validity of task construct and rating rubric went through a process suited for many-facet Rasch measurement supplemented with qualitative interviews. Participants (N = 40) were provided with a video excerpt from a real EFL lesson, and required to deliver a report on the teacher’s performance. Two raters graded the records of the candidates’ reports using rating scales developed to measure EFL teacher candidates’ oral English proficiency and ability to analyze and evaluate teaching. Many-facet Rasch analysis demonstrated a successful estimation, with a noticeable spread among the participants and their traits, proving the task functioned well in measuring candidates’ performance and reflecting the difference of their ability. The raters were found to have good internal self-consistency, but not the same leniency. The rating scales worked well, with the average measures advancing largely in line with Rasch expectations. Semi-structured interviews as well as focus group interviews were executed to provide knowledge regarding the raters’ performance levels and the functionalities of the rating scale items. The findings provide implications for further research and practice of the Certificate.

Download Full-text

Yesterday a Learner, Today a Teacher Too: Residents as Teachers in 2000

PEDIATRICS ◽

10.1542/peds.105.s2.238 ◽

2000 ◽

Vol 105 (Supplement_2) ◽

pp. 238-241

Author(s):

Elizabeth H. Morrison ◽

Janet Palmer Hafler

Keyword(s):

Medical Students ◽

Rating Scales ◽

Clinical Teaching ◽

Teaching Skills ◽

Learning Needs ◽

Educational Benefits ◽

Medical Centers ◽

Resident Physicians ◽

Academic Medical ◽

Rigorous Research

Resident physicians spend numerous hours every week teaching medical students and fellow residents, and only rarely are they taught how to teach. They can, however, be taught to teach more effectively. Teaching skills improvement initiatives for residents are taking a more prominent place in the educational literature. Limited evidence now suggests that better resident teachers mean better academic performance by learners. A small but important body of research supports selected interventions designed to improve residents' teaching skills, but not all studies have demonstrated significant educational benefits for learners. An increasing number of valid and reliable instruments are available to assess residents' clinical teaching, including objective structured teaching examinations and rating scales. In all specialties, rigorous research in evidence-based teacher training for residents will help prepare academic medical centers to meet the diverse and changing learning needs of today's physicians-in-training.resident physicians, medical students, fellow residents, teaching, graduate medical education.

Download Full-text

A video anchored rating scale leads to high inter-rater reliability of inexperienced and expert raters in the absence of rater training

The American Journal of Surgery ◽

10.1016/j.amjsurg.2019.12.026 ◽

2020 ◽

Vol 219 (2) ◽

pp. 221-226 ◽

Cited By ~ 1

Author(s):

Ronit Patnaik ◽

Nicholas E. Anton ◽

Dimitrios Stefanidis

Keyword(s):

Rating Scale ◽

Rater Training ◽

Rater Reliability

Download Full-text

The Inter-Rater Reliability of Technical Skills Assessment and Retention of Rater Training

Journal of Surgical Education ◽

10.1016/j.jsurg.2019.01.001 ◽

2019 ◽

Vol 76 (4) ◽

pp. 1088-1093

Author(s):

Nada Gawad ◽

Amanda Fowler ◽

Richard Mimeault ◽

Isabelle Raiche

Keyword(s):

Technical Skills ◽

Skills Assessment ◽

Rater Training ◽

Rater Reliability ◽

Technical Skills Assessment

Download Full-text

Visual assessment of movement quality in the single leg squat test: a review and meta-analysis of inter-rater and intrarater reliability

BMJ Open Sport & Exercise Medicine ◽

10.1136/bmjsem-2019-000541 ◽

2019 ◽

Vol 5 (1) ◽

pp. e000541 ◽

Cited By ~ 3

Author(s):

John Ressman ◽

Wilhelmus Johannes Andreas Grooten ◽

Eva Rasmussen Barr

Keyword(s):

Rating Scales ◽

Rating Scale ◽

Meta Analysis ◽

Intraclass Correlation ◽

Cochrane Library ◽

Intrarater Reliability ◽

Rater Reliability ◽

Movement Quality ◽

Step Down ◽

Single Leg Squat

Single leg squat (SLS) is a common tool used in clinical examination to set and evaluate rehabilitation goals, but also to assess lower extremity function in active people.ObjectivesTo conduct a review and meta-analysis on the inter-rater and intrarater reliability of the SLS, including the lateral step-down (LSD) and forward step-down (FSD) tests.DesignReview with meta-analysis.Data sourcesCINAHL, Cochrane Library, Embase, Medline (OVID) and Web of Science was searched up until December 2018.Eligibility criteriaStudies were eligible for inclusion if they were methodological studies which assessed the inter-rater and/or intrarater reliability of the SLS, FSD and LSD through observation of movement quality.ResultsThirty-one studies were included. The reliability varied largely between studies (inter-rater: kappa/intraclass correlation coefficients (ICC) = 0.00–0.95; intrarater: kappa/ICC = 0.13–1.00), but most of the studies reached ‘moderate’ measures of agreement. The pooled results of ICC/kappa showed a ‘moderate’ agreement for inter-rater reliability, 0.58 (95% CI 0.50 to 0.65), and a ‘substantial’ agreement for intrarater reliability, 0.68 (95% CI 0.60 to 0.74). Subgroup analyses showed a higher pooled agreement for inter-rater reliability of ≤3-point rating scales while no difference was found for different numbers of segmental assessments.ConclusionOur findings indicate that the SLS test including the FSD and LSD tests can be suitable for clinical use regardless of number of observed segments and particularly with a ≤3-point rating scale. Since most of the included studies were affected with some form of methodological bias, our findings must be interpreted with caution.PROSPERO registration numberCRD42018077822.

Download Full-text

Can High-Dimensional Questionnaires Resolve the Ipsativity Issue of Forced-Choice Response Formats?

Educational and Psychological Measurement ◽

10.1177/0013164420934861 ◽

2020 ◽

pp. 001316442093486

Author(s):

Niklas Schulte ◽

Heinz Holling ◽

Paul-Christian Bürkner

Keyword(s):

Rating Scales ◽

Forced Choice ◽

High Dimensional ◽

Choice Response ◽

Scoring Methods ◽

High Stakes ◽

Irt Models ◽

Response Formats ◽

Response Biases ◽

Very High

Forced-choice questionnaires can prevent faking and other response biases typically associated with rating scales. However, the derived trait scores are often unreliable and ipsative, making interindividual comparisons in high-stakes situations impossible. Several studies suggest that these problems vanish if the number of measured traits is high. To determine the necessary number of traits under varying sample sizes, factor loadings, and intertrait correlations, simulations were performed for the two most widely used scoring methods, namely the classical (ipsative) approach and Thurstonian item response theory (IRT) models. Results demonstrate that while especially Thurstonian IRT models perform well under ideal conditions, both methods yield insufficient reliabilities in most conditions resembling applied contexts. Moreover, not only the classical estimates but also the Thurstonian IRT estimates for questionnaires with equally keyed items remain (partially) ipsative, even when the number of traits is very high (i.e., 30). This result not only questions earlier assumptions regarding the use of classical scores in high-dimensional questionnaires, but it also raises doubts about many validation studies on Thurstonian IRT models because correlations of (partially) ipsative scores with external criteria cannot be interpreted in a usual way.

Download Full-text

A Comparison of Reliability Coefficients for Ordinal Rating Scales

Journal of Classification ◽

10.1007/s00357-021-09386-5 ◽

2021 ◽

Author(s):

Alexandra de Raadt ◽

Matthijs J. Warrens ◽

Roel J. Bosker ◽

Henk A. L. Kiers

Keyword(s):

Empirical Data ◽

Rating Scales ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Weighted Kappa ◽

Rater Reliability ◽

Intraclass Correlations ◽

Applied Researcher ◽

Highly Correlated ◽

Reliability Coefficients

AbstractKappa coefficients are commonly used for quantifying reliability on a categorical scale, whereas correlation coefficients are commonly applied to assess reliability on an interval scale. Both types of coefficients can be used to assess the reliability of ordinal rating scales. In this study, we compare seven reliability coefficients for ordinal rating scales: the kappa coefficients included are Cohen’s kappa, linearly weighted kappa, and quadratically weighted kappa; the correlation coefficients included are intraclass correlation ICC(3,1), Pearson’s correlation, Spearman’s rho, and Kendall’s tau-b. The primary goal is to provide a thorough understanding of these coefficients such that the applied researcher can make a sensible choice for ordinal rating scales. A second aim is to find out whether the choice of the coefficient matters. We studied to what extent we reach the same conclusions about inter-rater reliability with different coefficients, and to what extent the coefficients measure agreement in a similar way, using analytic methods, and simulated and empirical data. Using analytical methods, it is shown that differences between quadratic kappa and the Pearson and intraclass correlations increase if agreement becomes larger. Differences between the three coefficients are generally small if differences between rater means and variances are small. Furthermore, using simulated and empirical data, it is shown that differences between all reliability coefficients tend to increase if agreement between the raters increases. Moreover, for the data in this study, the same conclusion about inter-rater reliability was reached in virtually all cases with the four correlation coefficients. In addition, using quadratically weighted kappa, we reached a similar conclusion as with any correlation coefficient a great number of times. Hence, for the data in this study, it does not really matter which of these five coefficients is used. Moreover, the four correlation coefficients and quadratically weighted kappa tend to measure agreement in a similar way: their values are very highly correlated for the data in this study.

Download Full-text