Setting a standard for low reading proficiency: A comparison of the bookmark procedure and constrained mixture Rasch model

Tabea Feseker; Timo Gnambs; Cordula Artelt

doi:10.1371/journal.pone.0257871

Setting a standard for low reading proficiency: A comparison of the bookmark procedure and constrained mixture Rasch model

PLoS ONE ◽

10.1371/journal.pone.0257871 ◽

2021 ◽

Vol 16 (11) ◽

pp. e0257871

Author(s):

Tabea Feseker ◽

Timo Gnambs ◽

Cordula Artelt

Keyword(s):

Rasch Model ◽

Panel Study ◽

External Validation ◽

Reading Proficiency ◽

Standard Setting ◽

Cut Scores ◽

Model Based ◽

Constrained Mixture ◽

Cut Score ◽

Efficient Alternative

In order to draw pertinent conclusions about persons with low reading skills, it is essential to use validated standard-setting procedures by which they can be assigned to their appropriate level of proficiency. Since there is no standard-setting procedure without weaknesses, external validity studies are essential. Traditionally, studies have assessed validity by comparing different judgement-based standard-setting procedures. Only a few studies have used model-based approaches for validating judgement-based procedures. The present study addressed this shortcoming and compared agreement of the cut score placement between a judgement-based approach (i.e., Bookmark procedure) and a model-based one (i.e., constrained mixture Rasch model). This was performed by differentiating between individuals with low reading proficiency and those with a functional level of reading proficiency in three independent samples of the German National Educational Panel Study that included students from the ninth grade (N = 13,897) as well as adults (Ns = 5,335 and 3,145). The analyses showed quite similar mean cut scores for the two standard-setting procedures in two of the samples, whereas the third sample showed more pronounced differences. Importantly, these findings demonstrate that model-based approaches provide a valid and resource-efficient alternative for external validation, although they can be sensitive to the ability distribution within a sample.

Download Full-text

Standardsetting av læringsstøttende prøver i engelsk for Vg1

Acta Didactica Norge ◽

10.5617/adno.6281 ◽

2018 ◽

Vol 12 (4) ◽

pp. 15

Author(s):

Eli Moe ◽

Hildegunn Lahlum Helness ◽

Craig Grocott ◽

Norman Verhelst

Keyword(s):

Standard Setting ◽

Good Knowledge ◽

Cut Scores ◽

Listening Tests ◽

Cut Score ◽

Margin Of Error ◽

The Common ◽

11Th Grade ◽

Different Levels ◽

Level Of Agreement

Formålet med denne artikkelen er å beskrive framgangsmåten som ble brukt for å bestemme kuttskårer (grenser) mellom tre nivåer i Det europeiske ramme-verket for språk (A2, B1 og B2) på to læringsstøttende lytteprøver i engelsk for Vg1-elever. Målet har vært å undersøke om det er mulig å etablere enighet om kuttskårene, og om standardsetterne som deltok i arbeidet fikk tilstrekkelig opp-læring på forhånd. Videre var det et mål å se på hvilke konsekvenser kuttskårene vil få for fordeling av elever på de ulike rammeverksnivåene. Standardsettingen ble gjennomført med utgangspunkt i pilotdata fra 3199 elever på Vg1, Cito-metoden og 16 panelmedlemmer med god kjennskap til Rammeverkets nivåer. Flere av panelmedlemmene var eller hadde vært lærere i engelsk for elever på 10. trinn eller Vg1. Cito-metoden fungerte bra for å etablere kuttskårer som standardsetterne var forholdsvis enige om. Sluttresultatene viser at målefeilen var relativt liten. Resultatene viser større enighet om kuttskåren mellom nivåene B1 og B2 enn mellom A2 og B1, og dette kan ha en sammenheng med at det ble brukt mer tid på forberedelsesarbeid for B1 og B2. Lærere i panelet som kjenner elevgruppa godt, mener at konsekvensen kutt-skåren har for fordeling av elever på de ulike rammeverksnivåene, stemmer med deres egen vurdering av elevenes lytteferdigheter.Nøkkelord: standardsetting, testsentrert metode, Cito-metoden, standard, kutt-skår, vippekandidatStandard setting for English tests for 11th grade students in NorwayAbstractThis article presents the process used to determine the cut scores between three levels of the Common European Framework of Reference for languages (A2, B1 and B2) for two English listening tests, taken by Norwegian pupils at the 11th grade. The aim was to establish whether agreement can be reached on cut scores and whether the standard setters received enough preparation before the event. Another aim was to examine the potential consequences the cut scores would have for the distribution of pupils across the different levels. The standard setting took place using pilot data from 3199 pupils, the Cito method and 16 panel members with a good knowledge of the framework levels. Some panel members were or had been 10th or 11th grade English teachers. The Cito method worked well for establishing cut scores with which the panel members mostly agreed. The results indicated a small margin of error. The results showed a higher level of agreement for the cut score between B1 and B2 than between A2 and B1, possibly connected to the longer preparation time dedicated to B1 and B2. Teachers on the panel with good knowledge of the pupil base believe that the consequences these cut scores have for the distribution of pupils, correlate with their own experiences of pupils' ability.Keywords: standard setting, test-centered method, the Cito method, standard, cut score, borderline person / minimally competent user

Download Full-text

Cut-Score Operating Function Extensions: Penalty-Based Errors and Uncertainty in Standard Settings

Applied Psychological Measurement ◽

10.1177/01466216211046896 ◽

2021 ◽

pp. 014662162110468

Author(s):

Irina Grabovsky ◽

Jesse Pace ◽

Christopher Runyon

Keyword(s):

Standard Setting ◽

Optimal Choice ◽

Combined Effects ◽

Cut Scores ◽

Online Application ◽

Cut Score ◽

Classification Errors

We model pass/fail examinations aiming to provide a systematic tool to minimize classification errors. We use the method of cut-score operating functions to generate specific cut-scores on the basis of minimizing several important misclassification measures. The goal of this research is to examine the combined effects of a known distribution of examinee abilities and uncertainty in the standard setting on the optimal choice of the cut-score. In addition, we describe an online application that allows others to utilize the cut-score operating function for their own standard settings.

Download Full-text

Similarity of the cut score in test sets with different item amounts using the modified Angoff, modified Ebel, and Hofstee standard-setting methods for the Korean Medical Licensing Examination

Journal of Educational Evaluation for Health Professions ◽

10.3352/jeehp.2020.17.28 ◽

2020 ◽

Vol 17 ◽

pp. 28

Author(s):

Janghee Park ◽

Mi Kyoung Yim ◽

Na Jin Kim ◽

Duck Sun Ahn ◽

Young-Min Kim

Keyword(s):

Standard Setting ◽

Cut Scores ◽

The Past ◽

Score Difference ◽

Cut Score ◽

Medical Licensing ◽

Test Sets ◽

Licensing Examination ◽

Item Content ◽

Difficulty Index

Purpose: The Korea Medical Licensing Exam (KMLE) typically contains a large number of items. The purpose of this study was to investigate whether there is a difference in the cut score between evaluating all items of the exam and evaluating only some items when conducting standard-setting.Methods: We divided the item sets that appeared on 3 recent KMLEs for the past 3 years into 4 subsets of each year of 25% each based on their item content categories, discrimination index, and difficulty index. The entire panel of 15 members assessed all the items (360 items, 100%) of the year 2017. In split-half set 1, each item set contained 184 (51%) items of year 2018 and each set from split-half set 2 contained 182 (51%) items of the year 2019 using the same method. We used the modified Angoff, modified Ebel, and Hofstee methods in the standard-setting process.Results: Less than a 1% cut score difference was observed when the same method was used to stratify item subsets containing 25%, 51%, or 100% of the entire set. When rating fewer items, higher rater reliability was observed.Conclusion: When the entire item set was divided into equivalent subsets, assessing the exam using a portion of the item set (90 out of 360 items) yielded similar cut scores to those derived using the entire item set. There was a higher correlation between panelists’ individual assessments and the overall assessments.

Download Full-text

Comparison of validity of Bookmark and Angoff Standard Setting Methods in Medical performance tests

10.21203/rs.2.24421/v1 ◽

2020 ◽

Author(s):

majid yousefi afrashteh

Keyword(s):

External Validity ◽

Performance Test ◽

Internal Validity ◽

Standard Setting ◽

Medical Laboratory ◽

Cut Scores ◽

Master's Students ◽

Criterion Score ◽

Cut Score ◽

Validity Indices

Abstract Introduction One of the main processes in evaluating of the students’ performance is standard staging to determine the passage for the test. The purpose of this study was to compare the validity of two methods of Angoff and bookmark in standard setting. Method Participants included 190 master’s students graduated in laboratory sciences since past year. Designed by a group of experts, a performance test with 32 item was used in this study to assess laboratory skills of graduates of medical laboratory sciences. Moreover, two groups of experts voluntarily participated in this study to set the cut-score. To assess the process validity, a 5-item questionnaire was asked from two groups of penists. To investigate the internal validity, the variance of the cut scores determined by the members of the two panels was compared with the F ratio. External validity was assessed by using four indices of correlation test with criterion score. Results Comparison of the two methods of Angoff and bookmarking showed that the mean of process validity indices was higher in bookmarking method. In order to assess internal validity, conclusion: Homogeneity of results and co-ordination of judges' scores were considered. Conclusion In evaluating of the external validity (concordance of the cut score with the criterion score), All five external validity indices supported the bookmark method.

Download Full-text

Standard-setting methodology: Establishing performance standards and setting cut-scores to assist score interpretation

Applied Physiology Nutrition and Metabolism ◽

10.1139/apnm-2015-0522 ◽

2016 ◽

Vol 41 (6 (Suppl. 2)) ◽

pp. S74-S82 ◽

Cited By ~ 12

Author(s):

Bruno D. Zumbo

Keyword(s):

Best Practices ◽

Test Score ◽

Test Validity ◽

Performance Standards ◽

Standard Setting ◽

Cut Scores ◽

Ordered Categories ◽

Fitness For Duty ◽

Score Interpretation ◽

Cut Score

A critical step in the development and use of tests of physical fitness for employment purposes (e.g., fitness for duty) is to establish 1 or more cut points, dividing the test score range into 2 or more ordered categories reflecting, for example, fail/pass decisions. Over the last 3 decades elaborated theories and methods have evolved focusing on the process of establishing 1 or more cut-scores on a test. This elaborated process is widely referred to as “standard-setting”. As such, the validity of the test score interpretation hinges on the standard-setting, which embodies the purpose and rules according to which the test results are interpreted. The purpose of this paper is to provide an overview of standard-setting methodology. The essential features, key definitions and concepts, and various novel methods of informing standard-setting will be described. The focus is on foundational issues with an eye toward informing best practices with new methodology. Throughout, a case is made that in terms of best practices, establishing a test standard involves, in good part, setting a cut-score and can be conceptualized as evidence/data-based policy making that is essentially tied to test validity and an evidential trail.

Download Full-text

Less Subjectivity in Setting Cut Scores: A Novel Approach

Journal of Education and Vocational Research ◽

10.22610/jevr.v4i4.108 ◽

2013 ◽

Vol 4 (4) ◽

pp. 109-118

Author(s):

Jean Pierre Atanas

Keyword(s):

Standard Setting ◽

Cut Scores ◽

Policy Makers ◽

Certification Tests ◽

Novel Approach ◽

Cut Score ◽

Wide Range ◽

Assessment Techniques ◽

Passing Scores ◽

Educational Tests

Recently, standard-setting cut scores and assessment techniques became of major concerns for many organizational institutions worldwide. A cut score separates one performance level from another. It differentiates between those who pass and those who fail. They may vary according to the recommendations of policy makers and stakeholders. Passing scores were suggested by many methods on numerous types of tests: certification tests and educational tests. Most of these standard setting methods rely on panelistsâ€™ subjectivity in ordering items by level of difficulty. This paper presents a simple approach to assessments by minimizing considerably panelistsâ€™ subjectivity. Items are classified in levels of difficulties rather than in an increasing order in most of the standard methods. This novel approach respond to three main criteria: practicality, wide range of applicability and maximum agreement with the empirical data. Provisional and operational cut scores were derived and discussed.

Download Full-text

Comparison of the validity of bookmark and Angoff standard setting methods in medical performance tests

BMC Medical Education ◽

10.1186/s12909-020-02436-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Majid Yousefi Afrashteh

Keyword(s):

External Validity ◽

Internal Validity ◽

Standard Setting ◽

Pass Rates ◽

Cut Scores ◽

Predictive Values ◽

Laboratory Science ◽

Criterion Score ◽

Cut Score ◽

Validity Indices

Abstract Background One of the main processes of determining the ability level at which a student should pass an assessment is standard setting. The current study aimed to compare the validity of Angoff and bookmark methods in standard-setting. Method 190 individuals with an M.Sc. degree in laboratory science participated in the study. A test with 32 items, designed by a group of experts, was used to assess the laboratory skills of the participants. Moreover, two groups each containing 12 content specialists in laboratory sciences, voluntarily participated in the application of the Angoff and bookmark methods. To assess the process validity, a 5-item questionnaire was asked from two groups of panelists. To investigate the internal validity, the classification agreement was calculated using the kappa and Fleiss’s Kappa coefficient. External validity was assessed by using five indices (correlation with criterion score, specificity, sensitivity, and positive and negative predictive values of correlation test with criterion score). Results The results showed that the obtained cut-scores was 17.67 for Angoff and 18.8 for bookmark. The average total of items related to the quality of the execution process was 4.25 for the Angoff group and 4.79 for the bookmark group. Pass rates pass rates percentages for the Angoff and bookmark group were 55.78 and 41.36, respectively. Correlations of passing/failing, between employer ratings and test scores were 0.69 and 0.88 for Angoff and bookmark methods, respectively. Conclusion Based on the results, it can be concluded that the process and internal validities of the bookmark method were higher than the Angoff method. For evaluation of the external validity (concordance of the cut score with the criterion score), all five external validity indices supported the bookmark method.

Download Full-text

The Similarity of Bookmark Cut Scores With Different Response Probability Values

Educational and Psychological Measurement ◽

10.1177/0013164410395577 ◽

2011 ◽

Vol 71 (6) ◽

pp. 963-985 ◽

Cited By ~ 6

Author(s):

Adam E. Wyse

Keyword(s):

Large Scale ◽

Item Difficulty ◽

Response Probability ◽

Standard Setting ◽

State Testing ◽

Cut Scores ◽

Testing Program ◽

Cut Score ◽

Analytical Formulas ◽

Different Response

Standard setting is a method used to set cut scores on large-scale assessments. One of the most popular standard setting methods is the Bookmark method. In the Bookmark method, panelists are asked to envision a response probability (RP) criterion and move through a booklet of ordered items based on a RP criterion. This study investigates whether or not it is possible to end up with the same cut scores if one were to apply the Bookmark method with two different RP values. Analytical formulas and two hypothetical examples from a large-scale state testing program indicate that it is rarely possible to obtain the same cut score estimates with two different RP values because of the presence of item difficulty gaps present when applying the procedure in practice. Results indicate that if the same group of panelists applied the Bookmark procedure as it is traditionally explained, then cut scores should be lower with the second chosen RP value than they were with the first RP value. This result holds whether or not the second RP value is higher or lower than the first RP value. The examples also reveal that differences in cut score estimates with different RP values can lead to changes in the percentage of examinees at or above the cut scores that may have important practical impacts.

Download Full-text

Using the Angoff method to set a standard on mock exams for the Korean Nursing Licensing Examination

Journal of Educational Evaluation for Health Professions ◽

10.3352/jeehp.2020.17.14 ◽

2020 ◽

Vol 17 ◽

pp. 14

Author(s):

Mi Kyoung Yim ◽

Sujin Shin

Keyword(s):

Standard Setting ◽

Panel Survey ◽

Cut Scores ◽

Angoff Method ◽

Cut Score ◽

Competent Person ◽

Survey Results ◽

Licensing Examination ◽

High Level ◽

Subject Performance

Purpose: This study explored the possibility of using the Angoff method, in which panel experts determine the cut score of an exam, for the Korean Nursing Licensing Examination (KNLE). Two mock exams for the KNLE were analyzed. The Angoff standard setting procedure was conducted and the results were analyzed. We also aimed to examine the procedural validity of applying the Angoff method in this context.Methods: For both mock exams, we set a pass-fail cut score using the Angoff method. The standard setting panel consisted of 16 nursing professors. After the Angoff procedure, the procedural validity of establishing the standard was evaluated by investigating the responses of the standard setters.Results: The descriptions of the minimally competent person for the KNLE were presented at the levels of general and subject performance. The cut scores of first and second mock exams were 74.4 and 76.8, respectively. These were higher than the traditional cut score (60% of the total score of the KNLE). The panel survey showed very positive responses, with scores higher than 4 out of 5 points on a Likert scale.Conclusion: The scores calculated for both mock tests were similar, and were much higher than the existing cut scores. In the second simulation, the standard deviation of the Angoff rating was lower than in the first simulation. According to the survey results, procedural validity was acceptable, as shown by a high level of confidence. The results show that determining cut scores by an expert panel is an applicable method.

Download Full-text

Comparing the cut score for the borderline group method and borderline regression method with norm-referenced standard setting in an objective structured clinical examination in medical school in Korea

Journal of Educational Evaluation for Health Professions ◽

10.3352/jeehp.2021.18.25 ◽

2021 ◽

Vol 18 ◽

pp. 25

Author(s):

Song Yi Park ◽

Sang-Hwa Lee ◽

Min-Jeong Kim ◽

Ki-Hwan Ji ◽

Ji Ho Ryu

Keyword(s):

Clinical Examination ◽

Standard Setting ◽

Regression Method ◽

Objective Structured Clinical Examination ◽

Group Method ◽

Global Rating ◽

Cut Scores ◽

Cut Score ◽

Significant Difference ◽

Fourth Year Medical Students

Purpose: Setting standards is critical in health professions. However, appropriate standard setting methods do not always apply to the set cut score in performance assessment. The aim of this study was to compare the cut score when the standard setting is changed from the norm-referenced method to the borderline group method (BGM) and borderline regression method (BRM) in an objective structured clinical examination (OSCE) in medical school.Methods: This was an explorative study to model of the BGM and BRM. A total of 107 fourth-year medical students attended the OSCE at seven stations with encountering standardized patients (SPs) and one station with performing skills on a manikin on 15 July 2021. Thirty-two physician examiners evaluated the performance by completing a checklist and global rating scales.Results: The cut score of the norm-referenced method was lower than that of the BGM (p<0.01) and BRM (p<0.02). There was no significant difference in the cut score between the BGM and BRM (p=0.40). The station with the highest standard deviation and the highest proportion of the borderline group showed the largest cut score difference in standard setting methods.Conclusion: Prefixed cut scores by the norm-referenced method without considering station contents or examinee performance can vary due to station difficulty and content, affecting the appropriateness of standard setting decisions. If there is an adequate consensus on the criteria for the borderline group, standard setting with the BRM could be applied as a practical and defensible method to determine the cut score for OSCE.

Download Full-text