Story

Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

ComputingMind & Behavior

Key takeaway

Scientists found that giving AI systems "theory of mind" - the ability to understand others' thoughts and beliefs - can help them cooperate and communicate better, potentially leading to more advanced and capable AI assistants.

Read the paper

Quick Explainer

The core idea of this work is to investigate how the quality of machine translation affects the performance of large language models (LLMs) on Theory of Mind (ToM) benchmarks. ToM benchmarks, which evaluate an AI's ability to reason about others' mental states, are primarily constructed in English. The researchers translated two ToM benchmarks from English to Japanese using machine translation, then evaluated LLMs on both the original English and the translated Japanese versions. By manually correcting translation errors in a subset of the Japanese texts, they could quantify the impact of translation quality on LLM performance. This provides crucial insights for creating reliable ToM benchmarks in non-English languages, where the effects of translation errors must be considered.

Deep Dive

Investigating the Effects of Translation Quality on LLM Performance in Machine-Translated Theory of Mind Benchmarks

Overview

Recent years have seen the development of various benchmarks to evaluate the Theory of Mind (ToM) capabilities of large language models (LLMs)
Most ToM benchmarks are constructed in English, leaving a shortage of such benchmarks for other languages
A straightforward approach to creating non-English ToM benchmarks is to machine-translate existing English benchmarks, but the impact of translation errors on ToM evaluation remains unclear

Problem & Context

It is unknown how translation quality affects the performance of LLMs on ToM benchmarks
Understanding the effects of translation quality is crucial for creating reliable ToM benchmarks in non-English languages

Methodology

Selected two ToM benchmarks, ToMBench and FANToM, with distinct characteristics (narrative texts vs. dialogues)
Machine-translated the benchmarks from English to Japanese
Evaluated multiple state-of-the-art LLMs on the original English and machine-translated Japanese versions
Manually post-edited a subset of the Japanese translations to correct errors
Compared LLM accuracy before and after post-editing to investigate the impact of translation quality
Analyzed the relationship between the amount of translation errors in questions and their correctness using correlation analysis

Data & Experimental Setup

Benchmarks:
- ToMBench (Chen et al. 2024): Uses narrative texts as context
- FANToM (Kim et al. 2023): Uses dialogues as context
LLMs evaluated:
- Multiple recent state-of-the-art models
Translation:
- Machine-translated the benchmarks from English to Japanese
- Manually post-edited a subset of the Japanese translations to correct errors

Results

Machine translation consistently decreased evaluation scores for ToMBench across all models
Impact of machine translation was limited for FANToM
Manual post-editing to correct translation errors improved LLM performance on ToMBench
Translation errors related to accuracy greatly degraded evaluation scores

Interpretation

The impact of translation quality differs across ToM benchmarks, likely due to their distinct characteristics
Accuracy-related translation errors have a significant impact on ToM evaluation results
Creating reliable non-English ToM benchmarks requires careful consideration of translation quality

Limitations & Uncertainties

Only two ToM benchmarks were investigated
Manual post-editing was only performed on a subset of the translations
Relationship between specific types of translation errors and evaluation performance was not fully explored

What Comes Next

Expand the investigation to more ToM benchmarks and language pairs
Conduct a more comprehensive analysis of the impact of different types of translation errors
Explore methods to improve machine translation quality for ToM benchmarks

Source

Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind
PreprintarXiv cs.AI3/20/2026