Tracing L2 development via single samples collected longitudinally, which are often rated on quantitative complexity, accuracy, and fluency (CAF) measures, is a classic approach in Complex Dynamic System Theory (CDST) research. The reliability of those single samples, however, has been questioned by L2 assessment research using Generalizability Theory (GT) (e.g., Schoonen, 2005). Wu et al. (2022) therefore used GT to test the reliability of carefully restricted single task assessments rated on five CAF measures, and found that the reliability of the CAF scores differed substantially.
This inspired the current experiment to assess the reliability of CAF measures commonly used in assessing L2 (English) speaking. To this end, we searched for L2 studies researching English oral production published between 2016 and 2021 on Web of Science, from which we selected 57 quantitative CAF measures used by more than two articles without overlapping authors. The 57 measures were studied through a GT analysis on 275 recordings collected from 55 Chinese learners of English, who performed five oral tasks with different topics back to back individually.
Results from the GT analysis show the impact of task topic on L2 oral performance (see also Benton et al., 1995; Yang et al., 2015), and shed light on the reliability of quantitative CAF measures (Wu et al., 2022). They can inform CDST studies relying on single samples collected longitudinally which CAF measures have high reliability, i.e., are stable at a moment in time, and can therefore be used to distinguish L2 development from other kinds of variability. When tracing the development of certain low-reliability CAF measures (e.g., mean number of modifiers per noun phrase), on the other hand, it would be necessary to collect multiple samples at each datapoint, and further compare the variability within and in between data points.
Keywords: Complex Dynamic System Theory; Generalizability Theory; complexity, accuracy, fluency; L2 English speaking; task topic
Reference
Benton, S. L., Sharp, J. M., Corkill, A.J., Downey, R.G., & Khramtsova, I. (1995). Knowledge, interest, and narrative writing. Journal of educational psychology, 87, 66-79.
Schoonen, R. (2005). Generalizability of writing scores: an application of structural equation modeling. Language Testing, 22(1), 1–30. https://doi.org/10.1191/0265532205lt295oa
Wu, Y., Steinkrauss, R. & Lowie, W. (2022). The Reliability of Single Task Assessment in Longitudinal L2 Writing Research [Manuscript submitted for publication]. Department of Applied Linguistics, University of Groningen.
Yang, W., Lu, X., & Weigle, S. C. (2015). Different topics, different discourse: Relationships among writing topic, measures of syntactic complexity, and judgments of writing quality. Journal of Second Language Writing, 28, 53–67. https://doi.org/10.1016/j.jslw.2015.02.002