Experimenting with Machine Interpreting in the PL-EN Language Pair Are We (Getting) Close to “Human-Like” Quality?


Published: Jan 8, 2026
Keywords:
interpreting quality evaluation machine interpreting speech-to-speech AI interpreting
Tomasz Korybski
Wojciech Figiel
Małgorzata Tryuk
Michał Górnik
Abstract

Recent claims from technology companies suggesting that machine interpreting (MI) technology is approaching human-level quality remain largely unsubstantiated by ecologically valid empirical evidence. To address this gap, an experiment was conducted in June 2025 at the Institute of Applied Linguistics, University of Warsaw, comparing human simultaneous interpreting and two leading MI service providers in the Polish–English language pair. The experimental design simulated a real-life conference setting: a tandem of EU-accredited interpreters and two MA-level interpreting students worked alongside the two MI systems during a live event comprising an introductory speech in Polish, a 40-minute lecture in English, and a bi-directional Q&A session. Eleven student observers provided subjective perception data through an online survey (non-controlled) while recordings and transcripts served as the basis for a detailed error analysis. This paper focuses on a presentation of the latter: to gain an understanding of interpretation accuracy across the four outputs we used a simplified error-based approach adapted from Barik’s typology of errors and later methods such as the NER and NTR (Romero-Fresco and Pöchhacker 2017) and their adaptations for interpreting research (Davitti and Sandrelli 2020, Korybski and Davitti 2024).


A count of error instances in the transcripts reveals a clear quality sequence: accredited interpreters outperformed student interpreters, who in turn outperformed MI systems. As regards the weight of errors, machine-generated errors were more frequently of a major or meaning-distorting nature. This paper presents examples of common MI error types including misrecognitions propagated from automatic speech recognition, literal translations causing syntactic and stylistic distortions, redundant punctuation voicing, random language switches, and gender bias. Some errors clearly stem from a lack of contextual memory. In short, the overall speech-to-speech performance of the two MI systems lacked the flexibility, contextual awareness, and reformulation strategies characteristic of human interpreters. The findings suggest that, as of mid-2025 and given this experimental setup, MI in the PL-EN language pair remains far from human-like performance despite clear technological progress.

Article Details
  • Section
  • Articles
Downloads
Download data is not yet available.
References
Alonso-Bacigalupe, L. (2023) Joining forces for quality assessment in simultaneous interpreting: the NTR model, Sendebar: Revista de la Facultad de Traducción e Interpretación, 34, pp. 198–216.
Barik, H.C. (1971). A Description of Various Types of Omissions, Additions and Errors of Translation Encountered in Simultaneous Interpretation. Meta 16 (4), pp. 199-210. URL https://doi.org/10.7202/001972ar
Barik, H.C. (1975). Simultaneous Interpretation: Qualitative and Linguistic Data. Language and Speech 18 (3), 272-297
Collados Aís, Á. (2018) ‘Quality assessment and intonation in simultaneous interpreting’, MonTI: Monografías de Traducción e Interpretación, Special Issue 3.
Davitti, E., Sandrelli, A. (2020). Embracing the Complexity: A Pilot Study on Interlingual Respeaking. Journal of Audiovisual Translation 3(2), pp. 103-139. URL https://doi.org/10.47476/jat.v3i2.2020.135
Gieshoff, A.C. (2022) ‘Interpreting accuracy revisited: a refined approach to interpreting performance analysis’, Perspectives, 32(2), pp. 210–228.
Han, C. (2025). Quality assessment in multilingual, multimodal, and multiagent translation and interpreting (QAM3 T&I): Proposing a unifying framework for research. Interpreting and Society: An Interdisciplinary Journal, 5(1), 27-55. https://doi.org/10.1177/27523810251322645.
Kalina, S. (2005) ‘Quality assurance for interpreting processes’, Meta, 50(2), pp. 768–784.
Korybski, T., Davitti, E., Orasan, C, and Sabine Braun (2022). A Semi-Automated Live Interlingual Communication Workflow Featuring Intralingual Respeaking: Evaluation and Benchmarking. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 4405–4413, Marseille, France. European Language Resources Association.
Kopczyński, A. (1994). Quality in Conference Interpreting: Some Pragmatic Problems. In: Lambert, S., Moser-Mercer, B. (eds) Bridging the Gap: Empirical Research on Simultaneous Interpretation. John Benjamins, Amsterdam, pp. 87-99.
Korybski, T., Davitti, E. (2024). Human Agency in Live Subtitling through Respeaking: Towards a Taxonomy of Effective Editing. Journal of Audiovisual Translation, 7(2), 1–22. https://doi.org/10.47476/jat.v7i2.2024.302
Kurz, I. (1993). Conference interpretation: Expectations of Different User Groups. The Interpreters’ Newsletter 5, pp. 13-21.
Liang, L., & Lu, S. (2025). The evaluation and reception of the translation quality of three translation modalities in live-streaming contexts: computer-assisted simultaneous interpreting, machine translation (MT) with human revision and raw MT. The Translator, 1–19. https://doi.org/10.1080/13556509.2025.2494566Way, A. (2018) ‘Quality expectations of machine translation’, in J. Moorkens et al. (eds) Translation Quality Assessment. Cham: Springer International Publishing.
Pöchhacker, F. (2016) Introducing Interpreting Studies. 2nd edition. London: Routledge.
Romero-Fresco, P. and Pöchhacker, F. (2017) ‘Quality assessment in interlingual live subtitling: The NTR Model’, Linguistica Antverpiensia, New Series – Themes in Translation Studies, 15, pp. 149–167.
Way, A. (2018) ‘Quality expectations of machine translation’, in J. Moorkens et al. (eds) Translation Quality Assessment. Cham: Springer International Publishing.