RELIABILITY OF AI IN FOREIGN LANGUAGE SPEAKING ASSESSMENT: COMPARING AUTOMATED AND HUMAN SCORING AMONG UNDERGRADUATE IT STUDENTS IN KAZAKHSTAN
DOI:
https://doi.org/10.59787/2413-5488----%25pАннотация
The integration of Artificial Intelligence (AI) in language assessment, particularly in evaluating speaking skills, has introduced opportunities for greater consistency, efficiency, and scalability in educational contexts. This paper studies the reliability of AI-assisted speaking assessment compared to human-mediated evaluation, with a focus on inter-rater and intra-rater reliability in English as a Foreign Language (EFL) learning. This paper explores the strengths and limitations of AI in automated scoring, such as its capacity for standardization, alongside challenges related to validity, bias, and interpretability of results. This study reviews discrepancies between human and AI scoring due to subjective judgment and training limitations. The study emphasizes the need for standardized rubrics, rater training, and AI model calibration to enhance reliability. This paper concludes by proposing a hybrid assessment framework in which AI complements human raters, supported by methodological and technical improvements in speech recognition and natural language processing. This approach aims to optimize speaking proficiency evaluations while maintaining fairness and educational integrity.
Библиографические ссылки
Acharya, A. S., Prakash, A., Saxena, P., & Nigam, A. (2013). Sampling: Why and how of it. Indian Journal of Medical Specialties, 4(2), 330-333. DOI: 10.7713/ijms.2013.0032
Bogorevich, V. (2018). Native and Non-Native Raters of L2 Speaking Performance: Accent Familiarity and Cognitive Processes. Northern Arizona University ProQuest Dissertations & Theses, 2018. 10821820.
Chen, J., Lai, P., Chan, A., Man, V., & Chan, C. H. (2022). AIdakhil-assisted enhancement of student presentation skills: Challenges and opportunities. Sustainability, 15(1), 196.
Dogan, C. D., & Uluman, M. (2017). A comparison of rubrics and graded category rating scales with various methods regarding raters’ reliability. Educational Sciences: Theory and Practice, 17(2), 631–651. https://doi.org/10.12738/estp.2017.2.0321
Eckes, Thomas (2005). Examining Rater Effects in TestDaF Writing and Speaking Performance Assessments: A Many-Facet Rasch Analysis. Language Assessment Quarterly, 2(3), 197–221. doi:10.1207/s15434311laq0203_2
Eckes, Thomas. (2009). Many-facet Rasch measurement.
Ercikan & McCaffrey (2022). Optimizing Implementation of Artificial-Intelligence-Based Automated Scoring: An Evidence Centered Design Approach for Designing Assessments for AI-based Scoring. Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment https://doi.org/10.1111/jedm.12332
Hardré, P. L. (2014). Checked Your Bias Lately? Reasons and Strategies for Rural Teachers to Self-Assess for Grading Bias. Rural Educator, 35(2), n2.
He, H., Zou, B., & Du, Y. (2024, May 13). Bridging the Gap: Linking AI Technology Acceptance to Actual Improvements in EAP Learners' Speaking Skills. https://doi.org/10.31219/osf.io/syb62
International English Language Testing System. (2007). IELTS handbook 2007. Retrieved from https://www.ielts-writing.info/EXAM/docs/IELTS_Handbook_2007.pdf
Junaidi, J. (2020). Artificial intelligence in EFL context: rising students’ speaking performance with Lyra virtual assistance. International Journal of Advanced Science and Technology Rehabilitation, 29(5), 6735-6741.
Limgomolvilas, S., & Sukserm, P. (2025). Examining rater reliability when using an analytical rubric for oral presentation assessments. LEARN Journal: Language Education and Acquisition Research Network, 18(1), 110–134. https://doi.org/10.70730/JQGY9980
Makhlouf, M. K. I. (2021). Effect of artificial intelligence-based application on Saudi preparatory-year students' EFL speaking skills at Albaha University. International Journal of English Language Education, 9(2), 1–25. https://doi.org/10.5296/ijele.v9i2.18782
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276–282. DOI: 10.11613/BM.2012.031
Moser, A. (2020). Written corrective feedback: The role of learner engagement: A practical approach. Springer Cham. https://doi.org/10.1007/978-3-030-63994-5
Nichols, T. R., Wisner, P. M., Cripe, G., & Gulabchand, L. (2010). Putting the kappa statistic to use. The Quality Assurance Journal, 13(3–4), 57–61. https://doi.org/10.1002/qaj.481
Park, M. S. (2020). Rater Effects on L2 Oral Assessment: Focusing on Accent Familiarity of L2 Teachers. Language Assessment Quarterly, 17(3), 231-243. doi:10.1080/15434303.2020.1731752
Wang, J., & Luo, K. (2019). Evaluating rater judgments on ETIC Advanced writing tasks: An application of generalizability theory and many-facets Rasch model. Papers in Language Testing and Assessment, 8(2), 91–116.
Webb, S., Newton, J., & Chang, A. (2012). Incidental learning of collocation. Language Learning, 62(1), 91–120. https://doi.org/10.1111/j.1467-9922.2012.00729.x
Yesilyurt, Y. E. (2023). AI-Enabled Assessment and Feedback Mechanisms for Language Learning: Transforming Pedagogy and Learner Experience. In G. Kartal (Ed.), Transforming the Language Teaching Experience in the Age of AI (pp. 25-43). IGI Global. https://doi.org/10.4018/978-1-6684-9893-4.ch002
Zapf, A., Castell, S., Morawietz, L., & Karch, A. (2016). Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Medical Research Methodology, 16(1). https://doi.org/10.1186/s12874-016-0200-9
Zheng, C., Chen, X., Zhang, H., & Chai, C. S. (2024). Automated versus peer assessment: Effects of learners' English public speaking.
Zou, B., Liviero, S., Hao, M., & Wei, C. (2020). Artificial intelligence technology for EAP speaking skills: Student perceptions of opportunities and challenges. Technology and the psychology of second language learners and users, 433-463.