Rating across multiple languages: Perceptions of training and operational rating

This submission has open access

Abstract Summary

This presentation examines the results of both surveys and interviews of raters of a multi-language, large-scale assessment. By examining rater perceptions both widely, via a questionnaire and narrowly, via individual interviews, the study explores the questions: do raters using the same scale have similar or different experiences with rater training and rating across languages? How can such information support training, operational testing and even changes made to rating scales?

Over 60 raters using the same scale applied to constructed responses (speaking and writing) in 11 languages completed a survey about their perceptions of rater training and operational rating, including the aspects of rating to which they attended (Pill & Smart) and how the training prepared them for operational rating. A subset of respondents also participated in an interview to more deeply investigate how these issues intersect and can inform improvements. The study examines similarities and differences among languages.

Submission ID :

AILA413

Submission Type

Oral Presentation

Select a Symposium

[SYMP40] Language futures: tensions and synergies in the use of standards in language assessment in a multi/plurilingual world

Argument :

A great deal of research (for example, Pill & Smart, 2020; Attali, 2016; Kuiken, & Vedder, 2014) examines language test rater behavior and consistency, while complementary research (for example, Davis, 2016; Sato, 2014) focuses on rater perceptions of both their training activities, their operational rating processes, and how the two interrelate. However, most of this research focuses on raters working in a single language, often English. Far less research investigates rater perceptions of rater training and operational rating processes, including both challenges and successes, for raters working with a shared scale and a similar test across languages. Borger (2016) and Harsch & Malone (2020) also suggest that raters and other stakeholders can provide critical information not only about how a scale or scales are applied but also the difficulties in applying scales. Thus, this study asks: do raters using the same scale have similar or different experiences with rater training and rating across languages? How can such information support training, operational testing and even changes made to rating scales?

This current study examines research conducted with raters of constructed response tasks from a multi-language, large-scale assessment administered to over 100,000 learners annually in 11 languages (Chinese (Mandarin), English, French, German, Italian, Japanese, Korean, Portuguese, Russian and Spanish). The study first analyzes the results of a short questionnaire sent to speaking and writing raters (N=65) to examine their perceptions of rater training as well as what they attend to in operational rating, including task level and task function. The study also analyzes the outcomes of short interviews with a sub-set of raters (N=25) to shed light on the challenges of rating both generally according to the scale and specifically applying the training and the scale to different languages.

The presentation will specifically focus on how rater perceptions can inform and improve rater training approaches, exercises and activities. It will also show connections between rater training and operational rating. Additionally, the presentation will identify ways that rater perceptions and recommendations across languages can help rating approaches globally and within each language.

Beyond applications to rating and rater training, the study and its results will provide an opportunity to reflect on how raters interpret the scale and how it can be improved for clarity and accessibility across languages.

Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99-115.

Borger, L. (2019). Assessing interactional skills in a paired speaking test: Raters' interpretation of the construct. Apples-Journal of Applied Language Studies, 13(1), 151-174.

Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135.

Harsch, C., & Malone, M. E. (2020). Language proficiency frameworks and scales. In The Routledge handbook of second language acquisition and language testing (pp. 33-44). Routledge.

Kuiken, F., & Vedder, I. (2014). Raters' decisions, rating procedures and rating scales. Language testing, 31(3), 279-284.

Pill, J., & Smart, C. (2020). Raters: Behavior and training. In The Routledge Handbook of Second Language Acquisition and Language Testing (pp. 135-144). Routledge.

Sato, M. (2014). Exploring the construct of interactional oral fluency: Second language acquisition and language testing approaches. Syst