Assessing Interpreting Performance Through Human and AI Evaluation: Validity, Reliability, and Pedagogical Implications


Published: May 12, 2026
Keywords:
Artificial Intelligence (AI), AI-assisted assessment, interpreter assessment, student interpreting performance, healthcare interpreting, sight translation, consecutive interpreting, bidirectional dialogue interpreting, reliability, validity and practicability.
Effrossyni Fragkou
Abstract

This paper examines the use of artificial intelligence (AI) in assessing student interpreting performance in examination settings. It draws on a corpus of 30 audio recordings produced by six students in a first-year healthcare interpreting course within an MA in Conference Interpreting in Canada. The tasks include sight translation (EN<>FR), consecutive interpreting (EN<>FR), and bidirectional medical dialogue in healthcare settings. Student renditions are compared with original source texts, both audio and written, and evaluated against a pre-established assessment grid. The study compares human instructor assessment with AI-based assessment at two points: December 2024-January 2025, during the mid-term examination period, and February 2026, introducing a longitudinal dimension. Using a mixed-methods comparative design, it combines quantitative analysis of scoring patterns with qualitative analysis of convergences and divergences, focusing on accuracy, omissions, additions, distortions, and related assessment criteria. Findings suggest that human assessment better captures prosodic and interactional features, including pronunciation, intonation, rhythm, pausing, speaker attitude, pragmatic force, and hesitation. AI assessment appears relatively stronger in evaluating linguistic and textual dimensions, including content transfer, completeness, grammar, terminology, coherence, and cohesion. The paper also addresses anonymization, voice identifiability, AI use, validity, reliability, and bias.

Article Details
  • Section
  • Articles
Downloads
Download data is not yet available.
References
Alon, L., & Levkovich, I. (2026). Trusting the black box: Adapting a multidimensional measure of trust in generative AI. Computers in Human Behavior: Artificial Humans, 8, 100295. https://doi.org/10.1016/j.chbah.2026.100295
Bäckström, T. (2025). Privacy in speech technology. Proceedings of the IEEE, 113(7), 668–692. Doi: 10.1109/JPROC.2025.3632102
Barnett, J. (2023). The ethical implications of generative audio models: A systematic literature review. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. Doi: 10.1145/3600211.3604686
Bélisle-Pipon, J.-C., Powell, M., English, R., Malo, M.-F., Ravitsky, V., Bridge2AI–Voice Consortium, & Bensoussan, Y. (2024). Stakeholder perspectives on ethical and trustworthy voice AI in health care. Digital Health, 10. Doi:10.1177/20552076241260407
Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604. Doi: 10.1162/tacl_a_00041
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). Doi: 10.1145/3442188.3445922
Brookhart, S. M. (2018). Appropriate criteria: Key to effective rubrics. Frontiers in Education, 3, Article 22. https://doi.org/10.3389/feduc.2018.00022
Da Silva, J. (2021). Producing “good enough” automated transcripts securely: Extending Bokhove and Downey (2018) to address security concerns. Methodological Innovations, 14(1). https://doi.org/10.1177/2059799120987766
Eftekhari, H. (2024). Transcribing in the digital age: Qualitative research practice utilizing intelligent speech recognition technology. European Journal of Cardiovascular Nursing, 23(5), 553–560. doi:10.1093/eurjcn/zvae013
Feng, S., Halpern, B. M., Kudina, O., & Scharenborg, O. (2023). Towards inclusive automatic speech recognition. Computer Speech & Language, 84, 101567. https://doi.org/10.1016/j.csl.2023.101567
Google. (n.d.). Gemini API documentation and product materials. Accessed: 15 February 2026.
Han, C., & Lu, X. (2021). Interpreting quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and Society, 1(1), 70–90. https://doi.org/10.1177/27523810211033670 (ResearchGate)
Han, C., & Lu, X. (2025). Beyond BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning settings. Research Methods in Applied Linguistics, 4(1), 100184. https://doi.org/10.1016/j.rmal.2025.100184
Han, C., Lu, X., & Chen, S. (2025). Modeling rater judgments of interpreting quality: Ordinal logistic regression using neural-based evaluation metrics, acoustic fluency measures, and computational linguistic indices. Research Methods in Applied Linguistics, 4(1), 100194. https://doi.org/10.1016/j.rmal.2025.100194
Herdiyanti, A. (2024). The use of automatic AI-based notes and transcription services in qualitative research: Ethical and methodological concerns. In Proceedings of the ALISE Annual Conference. Doi: 10.21900/j.alise.2024.1717
Jiang, Z., & Zhang, Z. (2025). From black box to transparency: Enhancing automated interpreting assessment with explainable AI in college classrooms. Research Methods in Applied Linguistics, 4(3), 100237. https://doi.org/10.1016/j.rmal.2025.100237
Jönsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., & Goel, S. (2020). Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117, 7684–7689. https://doi.org/10.1073/pnas.1915768117
Kröger, J.L., Lutz, O.HM., Raschke, P. (2020). Privacy Implications of Voice and Speech Analysis – Information Disclosure by Inference. In: Friedewald, M., Önen, M., Lievens, E., Krenn, S., Fricker, S. (eds) Privacy and Identity Management. Data for Better Living: AI and Privacy. Privacy and Identity 2019. IFIP Advances in Information and Communication Technology(), vol 576. Springer, Cham. https://doi.org/10.1007/978-3-030-42504-3_16
Leschanowsky, A., Rusti, C., Quinlan, C., Pnacek, M., Gorce, L., & Hutiri, W. (2025). A Data Perspective on Ethical Challenges in Voice Biometrics Research. IEEE Transactions on Biometrics, Behavior, and Identity Science, 7(1), 118-131. https://doi.org/10.1109/TBIOM.2024.3446846
Macháček, D., Bojar, O., & Dabre, R. (2023). MT metrics correlate with human ratings of simultaneous speech translation. In E. Salesky, M. Federico, & M. Carpuat (Eds.), Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023) (pp. 169–179). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.iwslt-1.12
McCowan, I. A., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., & Bourlard, H. (2005). On the use of information retrieval measures for speech recognition evaluation (IDIAP Research Report No. 04-73). IDIAP Research Institute.
McMullin, C. (2023). Transcription and qualitative methods: Implications for third sector research. VOLUNTAS: International Journal of Voluntary and Nonprofit Organizations, 34(1), 140–153. https://doi.org/10.1007/s11266-021-00400-3
Morris, A. C., Maier, V., & Green, P. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. Proceedings of Interspeech 2004, 2765–2768. https://doi.org/10.21437/Interspeech.2004-668
Nautsch, A., Jasserand, C., Kindt, E., Todisco, M., Trancoso, I., & Evans, N. (2019). The GDPR & speech data: Reflections of legal and technology communities, first steps towards a common understanding. In Proceedings of Interspeech 2019 (pp. 3695–3699). https://doi.org/10.21437/Interspeech.2019-2647
Nautsch, A., Jiménez, A., Treiber, A., Kolberg, J., Jasserand, C., Kindt, E., Delgado, H., Todisco, M., Hmani, M. A., Mtibaa, A., Abdelraheem, M. A., Abad, A., Teixeira, F., Gomez-Barrero, M., Petrovska-Delacrétaz, D., Chollet, G., Evans, N., Schneider, T., Bonastre, J.-F., Raj, B., Trancoso, I., & Busch, C. (2019b). Preserving privacy in speaker and speech characterisation. Computer Speech & Language, 58, 441–480. Doi: 10.1016/j.csl.2019.06.001
Notta. (n.d.). Product and privacy documentation. Accessed: 5 February 2026
OpenAI. (n.d.). Speech-to-text and ChatGPT product documentation. Accessed: 21 January 2026
Panadero, E., & Jönsson, A. (2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational Research Review, 9, 129–144. https://doi.org/10.1016/j.edurev.2013.01.002
Pöchhacker, F. (2001). Quality assessment in conference and community interpreting. Meta, 46(2), 410–425. https://doi.org/10.7202/003847ar
Samuel, G., & Wassenaar, D. (2025). Joint editorial: Informed consent and AI transcription of qualitative data. Journal of Empirical Research on Human Research Ethics, 20(1–2), 3–5. https://doi.org/10.1177/15562646241296712
Shafiei, S. (2024). A proposed analytic rubric for consecutive interpreting assessment: Implications for similar contexts. Language Testing in Asia, 14, Article 13. https://doi.org/10.1186/s40468-024-00278-0
Srivastava, B. M. L., Maouche, M., Sahidullah, M., Vincent, E., Bellet, A., Tommasi, M., Tomashenko, N., Wang, X., & Yamagishi, J. (2022). Privacy and utility of x-vector based speaker anonymization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 2383–2395. doi:10.1109/TASLP.2022.3190741
Stewart, C., Vogler, N., Hu, J., Boyd-Graber, J., & Neubig, G. (2018). Automatic estimation of simultaneous interpreter performance. In I. Gurevych & Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 662–666). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-2105
Tomashenko, N., Wang, X., Vincent, E., Patino, J., Srivastava, B. M. L., Noé, P.-G., Nautsch, A., Evans, N., Yamagishi, J., O’Brien, B., Chanclu, A., Bonastre, J.-F., Todisco, M., & Maouche, M. (2022). The VoicePrivacy 2020 Challenge: Results and findings. Computer Speech & Language, 74, Article 101362. Doi: 10.1016/j.csl.2022.101362
Wang, X., & Wang, B. (2024). Identifying fluency parameters for a machine-learning-based automated interpreting assessment system. Perspectives, 32(2), 278–294. https://doi.org/10.1080/0907676X.2022.2133618
Wang, X., & Wang, B. (2025). Advancing automatic assessment of target-language quality in interpreter training with large language models: Insights from explainable AI. The Interpreter and Translator Trainer, 19(3–4), 465–485. https://doi.org/10.1080/1750399X.2025.2533015
Wang, X., & Yuan, L. (2023). Machine-learning based automatic assessment of communication in interpreting. Frontiers in Communication, 8, 1047753. https://doi.org/10.3389/fcomm.2023.1047753
Most read articles by the same author(s)