Reliability of AI in Foreign Language Speaking Assessment: Comparing Automated and Human Scoring Among Undergraduate IT Students in Kazakhstan

Zhibek Tleshova; Zhanar Tusselbayeva; Aelita Ichshanova; Aigerim Urazbekova; Meruyert  Zhenisbayeva; Ali Orymbayev

doi:10.59787/2413-5488-2025-50-2-18-31

Authors

Zhibek Tleshova Astana IT University
Dr Tusselbayeva Astana IT University
Aelita Ichshanova Astana IT University
Aigerim Urazbekova Astana IT University
Meruyert Zhenisbayeva Astana IT University
Ali Orymbayev Astana International University (AIU)

DOI:

https://doi.org/10.59787/2413-5488-2025-50-2-18-31

Abstract

The integration of Artificial Intelligence (AI) in language assessment, particularly in evaluating speaking skills, has introduced opportunities for greater consistency, efficiency, and scalability in educational contexts. This paper studies the reliability of AI-assisted speaking assessment compared to human-mediated evaluation, with a focus on inter-rater and intra-rater reliability in English as a Foreign Language (EFL) learning. This paper explores the strengths and limitations of AI in automated scoring, such as its capacity for standardization, alongside challenges related to validity, bias, and interpretability of results. This study reviews discrepancies between human and AI scoring due to subjective judgment and training limitations. The study emphasizes the need for standardized rubrics, rater training, and AI model calibration to enhance reliability. This paper concludes by proposing a hybrid assessment framework in which AI complements human raters, supported by methodological and technical improvements in speech recognition and natural language processing. This approach aims to optimize speaking proficiency evaluations while maintaining fairness and educational integrity.

Author Biographies

Zhibek Tleshova, Astana IT University

Candidate of Pedagogical Sciences, Associate professor, Astana IT University, e-mail: zhibek.tleshova@astanait.edu.kz, ORCID 0000-0001-5095-5436 (corresponding author)
Dr Tusselbayeva, Astana IT University

Candidate of Pedagogical Sciences, Associate professor, Astana IT University, e-mail: zhanar.tusselbayeva@astanait.edu.kz, ORCID 0000-0002-0832-7898
Aelita Ichshanova , Astana IT University

Master of Arts, Senior-lecturer, Astana IT University, e-mail: aelita.ichshanova@astanait.edu.kz, ORCID 0000-0003-4099-855X
Aigerim Urazbekova, Astana IT University

MSc in TESOL, Senior-lecturer, Astana IT University, e-mail: aigerim.urazbekova@astanait.edu.kz, ORCID 0000-0002-5641-0303
Meruyert Zhenisbayeva, Astana IT University

MA in Foreign Philology Sciences, Senior-lecturer, Astana IT University, e-mail: meruyert.zhenisbayeva@astanait.edu.kz, ORCID 0000-0002-4858-3394
Ali Orymbayev, Astana International University (AIU)

Master's student in Computer Engineering and Software, Astana International University (AIU), e-mail: phigadamer@proton.me, ORCID 0009-0003-0166-5653

References

Acharya, A. S., Prakash, A., Saxena, P., & Nigam, A. (2013). Sampling: Why and how of it. Indian Journal of Medical Specialties, 4(2), 330-333. DOI: 10.7713/ijms.2013.0032

Bogorevich, V. (2018). Native and Non-Native Raters of L2 Speaking Performance: Accent Familiarity and Cognitive Processes. Northern Arizona University ProQuest Dissertations & Theses, 2018. 10821820.

Chen, J., Lai, P., Chan, A., Man, V., & Chan, C. H. (2022). AIdakhil-assisted enhancement of student presentation skills: Challenges and opportunities. Sustainability, 15(1), 196.

Dogan, C. D., & Uluman, M. (2017). A comparison of rubrics and graded category rating scales with various methods regarding raters’ reliability. Educational Sciences: Theory and Practice, 17(2), 631–651. https://doi.org/10.12738/estp.2017.2.0321

Eckes, Thomas (2005). Examining Rater Effects in TestDaF Writing and Speaking Performance Assessments: A Many-Facet Rasch Analysis. Language Assessment Quarterly, 2(3), 197–221. doi:10.1207/s15434311laq0203_2

Eckes, Thomas. (2009). Many-facet Rasch measurement.

Ercikan & McCaffrey (2022). Optimizing Implementation of Artificial-Intelligence-Based Automated Scoring: An Evidence Centered Design Approach for Designing Assessments for AI-based Scoring. Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment https://doi.org/10.1111/jedm.12332

Hardré, P. L. (2014). Checked Your Bias Lately? Reasons and Strategies for Rural Teachers to Self-Assess for Grading Bias. Rural Educator, 35(2), n2.

He, H., Zou, B., & Du, Y. (2024, May 13). Bridging the Gap: Linking AI Technology Acceptance to Actual Improvements in EAP Learners' Speaking Skills. https://doi.org/10.31219/osf.io/syb62

International English Language Testing System. (2007). IELTS handbook 2007. Retrieved from https://www.ielts-writing.info/EXAM/docs/IELTS_Handbook_2007.pdf

Junaidi, J. (2020). Artificial intelligence in EFL context: rising students’ speaking performance with Lyra virtual assistance. International Journal of Advanced Science and Technology Rehabilitation, 29(5), 6735-6741.

Limgomolvilas, S., & Sukserm, P. (2025). Examining rater reliability when using an analytical rubric for oral presentation assessments. LEARN Journal: Language Education and Acquisition Research Network, 18(1), 110–134. https://doi.org/10.70730/JQGY9980

Makhlouf, M. K. I. (2021). Effect of artificial intelligence-based application on Saudi preparatory-year students' EFL speaking skills at Albaha University. International Journal of English Language Education, 9(2), 1–25. https://doi.org/10.5296/ijele.v9i2.18782

McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276–282. DOI: 10.11613/BM.2012.031

Moser, A. (2020). Written corrective feedback: The role of learner engagement: A practical approach. Springer Cham. https://doi.org/10.1007/978-3-030-63994-5

Nichols, T. R., Wisner, P. M., Cripe, G., & Gulabchand, L. (2010). Putting the kappa statistic to use. The Quality Assurance Journal, 13(3–4), 57–61. https://doi.org/10.1002/qaj.481

Park, M. S. (2020). Rater Effects on L2 Oral Assessment: Focusing on Accent Familiarity of L2 Teachers. Language Assessment Quarterly, 17(3), 231-243. doi:10.1080/15434303.2020.1731752

Wang, J., & Luo, K. (2019). Evaluating rater judgments on ETIC Advanced writing tasks: An application of generalizability theory and many-facets Rasch model. Papers in Language Testing and Assessment, 8(2), 91–116.

Webb, S., Newton, J., & Chang, A. (2012). Incidental learning of collocation. Language Learning, 62(1), 91–120. https://doi.org/10.1111/j.1467-9922.2012.00729.x

Yesilyurt, Y. E. (2023). AI-Enabled Assessment and Feedback Mechanisms for Language Learning: Transforming Pedagogy and Learner Experience. In G. Kartal (Ed.), Transforming the Language Teaching Experience in the Age of AI (pp. 25-43). IGI Global. https://doi.org/10.4018/978-1-6684-9893-4.ch002

Zapf, A., Castell, S., Morawietz, L., & Karch, A. (2016). Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Medical Research Methodology, 16(1). https://doi.org/10.1186/s12874-016-0200-9

Zheng, C., Chen, X., Zhang, H., & Chai, C. S. (2024). Automated versus peer assessment: Effects of learners' English public speaking.

Zou, B., Liviero, S., Hao, M., & Wei, C. (2020). Artificial intelligence technology for EAP speaking skills: Student perceptions of opportunities and challenges. Technology and the psychology of second language learners and users, 433-463.

Reliability of AI in Foreign Language Speaking Assessment: Comparing Automated and Human Scoring Among Undergraduate IT Students in Kazakhstan

Authors

DOI:

Abstract

Author Biographies

References

Downloads

Published

Versions

Issue

Section

Language

Make a Submission

Information

Links

Latest publications