Assessing Interpreting Performance Through Human and AI Evaluation: Validity, Reliability, and Pedagogical Implications

Effrossyni Fragkou

doi:10.12681/ijltic.45426

Assessing Interpreting Performance Through Human and AI Evaluation: Validity, Reliability, and Pedagogical Implications

PDF

Published: May 12, 2026

DOI: https://doi.org/10.12681/ijltic.45426

Keywords:

Artificial Intelligence (AI), AI-assisted assessment, interpreter assessment, student interpreting performance, healthcare interpreting, sight translation, consecutive interpreting, bidirectional dialogue interpreting, reliability, validity and practicability.

Effrossyni Fragkou

National and Kapodistrian University of Athens

Abstract

This paper examines the use of artificial intelligence (AI) in assessing student interpreting performance in examination settings. It draws on a corpus of 30 audio recordings produced by six students in a first-year healthcare interpreting course within an MA in Conference Interpreting in Canada. The tasks include sight translation (EN<>FR), consecutive interpreting (EN<>FR), and bidirectional medical dialogue in healthcare settings. Student renditions are compared with original source texts, both audio and written, and evaluated against a pre-established assessment grid. The study compares human instructor assessment with AI-based assessment at two points: December 2024-January 2025, during the mid-term examination period, and February 2026, introducing a longitudinal dimension. Using a mixed-methods comparative design, it combines quantitative analysis of scoring patterns with qualitative analysis of convergences and divergences, focusing on accuracy, omissions, additions, distortions, and related assessment criteria. Findings suggest that human assessment better captures prosodic and interactional features, including pronunciation, intonation, rhythm, pausing, speaker attitude, pragmatic force, and hesitation. AI assessment appears relatively stronger in evaluating linguistic and textual dimensions, including content transfer, completeness, grammar, terminology, coherence, and cohesion. The paper also addresses anonymization, voice identifiability, AI use, validity, reliability, and bias.

Article Details

How to Cite
Fragkou, E. (2026). Assessing Interpreting Performance Through Human and AI Evaluation: Validity, Reliability, and Pedagogical Implications. International Journal of Language, Translation and Intercultural Communication, 11, 114–153. https://doi.org/10.12681/ijltic.45426
More Citation Formats

ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver
Download Citation

Endnote/Zotero/Mendeley (RIS)
BibTeX

Section
Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

Downloads

Download data is not yet available.

References

Alon, L., & Levkovich, I. (2026). Trusting the black box: Adapting a multidimensional measure of trust in generative AI. Computers in Human Behavior: Artificial Humans, 8, 100295. https://doi.org/10.1016/j.chbah.2026.100295

Bäckström, T. (2025). Privacy in speech technology. Proceedings of the IEEE, 113(7), 668–692. Doi: 10.1109/JPROC.2025.3632102

Barnett, J. (2023). The ethical implications of generative audio models: A systematic literature review. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. Doi: 10.1145/3600211.3604686

Bélisle-Pipon, J.-C., Powell, M., English, R., Malo, M.-F., Ravitsky, V., Bridge2AI–Voice Consortium, & Bensoussan, Y. (2024). Stakeholder perspectives on ethical and trustworthy voice AI in health care. Digital Health, 10. Doi:10.1177/20552076241260407

Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604. Doi: 10.1162/tacl_a_00041

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). Doi: 10.1145/3442188.3445922

Brookhart, S. M. (2018). Appropriate criteria: Key to effective rubrics. Frontiers in Education, 3, Article 22. https://doi.org/10.3389/feduc.2018.00022

Da Silva, J. (2021). Producing “good enough” automated transcripts securely: Extending Bokhove and Downey (2018) to address security concerns. Methodological Innovations, 14(1). https://doi.org/10.1177/2059799120987766

Eftekhari, H. (2024). Transcribing in the digital age: Qualitative research practice utilizing intelligent speech recognition technology. European Journal of Cardiovascular Nursing, 23(5), 553–560. doi:10.1093/eurjcn/zvae013

Feng, S., Halpern, B. M., Kudina, O., & Scharenborg, O. (2023). Towards inclusive automatic speech recognition. Computer Speech & Language, 84, 101567. https://doi.org/10.1016/j.csl.2023.101567

Google. (n.d.). Gemini API documentation and product materials. Accessed: 15 February 2026.

Han, C., & Lu, X. (2021). Interpreting quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and Society, 1(1), 70–90. https://doi.org/10.1177/27523810211033670 (ResearchGate)

Han, C., & Lu, X. (2025). Beyond BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning settings. Research Methods in Applied Linguistics, 4(1), 100184. https://doi.org/10.1016/j.rmal.2025.100184

Han, C., Lu, X., & Chen, S. (2025). Modeling rater judgments of interpreting quality: Ordinal logistic regression using neural-based evaluation metrics, acoustic fluency measures, and computational linguistic indices. Research Methods in Applied Linguistics, 4(1), 100194. https://doi.org/10.1016/j.rmal.2025.100194

Herdiyanti, A. (2024). The use of automatic AI-based notes and transcription services in qualitative research: Ethical and methodological concerns. In Proceedings of the ALISE Annual Conference. Doi: 10.21900/j.alise.2024.1717

Jiang, Z., & Zhang, Z. (2025). From black box to transparency: Enhancing automated interpreting assessment with explainable AI in college classrooms. Research Methods in Applied Linguistics, 4(3), 100237. https://doi.org/10.1016/j.rmal.2025.100237

Jönsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002

Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., & Goel, S. (2020). Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117, 7684–7689. https://doi.org/10.1073/pnas.1915768117

Kröger, J.L., Lutz, O.HM., Raschke, P. (2020). Privacy Implications of Voice and Speech Analysis – Information Disclosure by Inference. In: Friedewald, M., Önen, M., Lievens, E., Krenn, S., Fricker, S. (eds) Privacy and Identity Management. Data for Better Living: AI and Privacy. Privacy and Identity 2019. IFIP Advances in Information and Communication Technology(), vol 576. Springer, Cham. https://doi.org/10.1007/978-3-030-42504-3_16

Leschanowsky, A., Rusti, C., Quinlan, C., Pnacek, M., Gorce, L., & Hutiri, W. (2025). A Data Perspective on Ethical Challenges in Voice Biometrics Research. IEEE Transactions on Biometrics, Behavior, and Identity Science, 7(1), 118-131. https://doi.org/10.1109/TBIOM.2024.3446846

Macháček, D., Bojar, O., & Dabre, R. (2023). MT metrics correlate with human ratings of simultaneous speech translation. In E. Salesky, M. Federico, & M. Carpuat (Eds.), Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023) (pp. 169–179). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.iwslt-1.12

McCowan, I. A., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., & Bourlard, H. (2005). On the use of information retrieval measures for speech recognition evaluation (IDIAP Research Report No. 04-73). IDIAP Research Institute.

McMullin, C. (2023). Transcription and qualitative methods: Implications for third sector research. VOLUNTAS: International Journal of Voluntary and Nonprofit Organizations, 34(1), 140–153. https://doi.org/10.1007/s11266-021-00400-3

Morris, A. C., Maier, V., & Green, P. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. Proceedings of Interspeech 2004, 2765–2768. https://doi.org/10.21437/Interspeech.2004-668

Nautsch, A., Jasserand, C., Kindt, E., Todisco, M., Trancoso, I., & Evans, N. (2019). The GDPR & speech data: Reflections of legal and technology communities, first steps towards a common understanding. In Proceedings of Interspeech 2019 (pp. 3695–3699). https://doi.org/10.21437/Interspeech.2019-2647

Nautsch, A., Jiménez, A., Treiber, A., Kolberg, J., Jasserand, C., Kindt, E., Delgado, H., Todisco, M., Hmani, M. A., Mtibaa, A., Abdelraheem, M. A., Abad, A., Teixeira, F., Gomez-Barrero, M., Petrovska-Delacrétaz, D., Chollet, G., Evans, N., Schneider, T., Bonastre, J.-F., Raj, B., Trancoso, I., & Busch, C. (2019b). Preserving privacy in speaker and speech characterisation. Computer Speech & Language, 58, 441–480. Doi: 10.1016/j.csl.2019.06.001

Notta. (n.d.). Product and privacy documentation. Accessed: 5 February 2026

OpenAI. (n.d.). Speech-to-text and ChatGPT product documentation. Accessed: 21 January 2026

Panadero, E., & Jönsson, A. (2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational Research Review, 9, 129–144. https://doi.org/10.1016/j.edurev.2013.01.002

Pöchhacker, F. (2001). Quality assessment in conference and community interpreting. Meta, 46(2), 410–425. https://doi.org/10.7202/003847ar

Samuel, G., & Wassenaar, D. (2025). Joint editorial: Informed consent and AI transcription of qualitative data. Journal of Empirical Research on Human Research Ethics, 20(1–2), 3–5. https://doi.org/10.1177/15562646241296712

Shafiei, S. (2024). A proposed analytic rubric for consecutive interpreting assessment: Implications for similar contexts. Language Testing in Asia, 14, Article 13. https://doi.org/10.1186/s40468-024-00278-0

Srivastava, B. M. L., Maouche, M., Sahidullah, M., Vincent, E., Bellet, A., Tommasi, M., Tomashenko, N., Wang, X., & Yamagishi, J. (2022). Privacy and utility of x-vector based speaker anonymization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 2383–2395. doi:10.1109/TASLP.2022.3190741

Stewart, C., Vogler, N., Hu, J., Boyd-Graber, J., & Neubig, G. (2018). Automatic estimation of simultaneous interpreter performance. In I. Gurevych & Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 662–666). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-2105

Tomashenko, N., Wang, X., Vincent, E., Patino, J., Srivastava, B. M. L., Noé, P.-G., Nautsch, A., Evans, N., Yamagishi, J., O’Brien, B., Chanclu, A., Bonastre, J.-F., Todisco, M., & Maouche, M. (2022). The VoicePrivacy 2020 Challenge: Results and findings. Computer Speech & Language, 74, Article 101362. Doi: 10.1016/j.csl.2022.101362

Wang, X., & Wang, B. (2024). Identifying fluency parameters for a machine-learning-based automated interpreting assessment system. Perspectives, 32(2), 278–294. https://doi.org/10.1080/0907676X.2022.2133618

Wang, X., & Wang, B. (2025). Advancing automatic assessment of target-language quality in interpreter training with large language models: Insights from explainable AI. The Interpreter and Translator Trainer, 19(3–4), 465–485. https://doi.org/10.1080/1750399X.2025.2533015

Wang, X., & Yuan, L. (2023). Machine-learning based automatic assessment of communication in interpreting. Frontiers in Communication, 8, 1047753. https://doi.org/10.3389/fcomm.2023.1047753

Assessing Interpreting Performance Through Human and AI Evaluation: Validity, Reliability, and Pedagogical Implications

Abstract

Article Details

Downloads

References

Information

Make a Submission

Social Media

Most read articles by the same author(s)