Curious Now

Story

Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

ComputingMind & Behavior

Key takeaway

Scientists found that giving AI systems "theory of mind" - the ability to understand others' thoughts and beliefs - can help them cooperate and communicate better, potentially leading to more advanced and capable AI assistants.

Read the paper

Quick Explainer

The core idea of this work is to investigate how the quality of machine translation affects the performance of large language models (LLMs) on Theory of Mind (ToM) benchmarks. ToM benchmarks, which evaluate an AI's ability to reason about others' mental states, are primarily constructed in English. The researchers translated two ToM benchmarks from English to Japanese using machine translation, then evaluated LLMs on both the original English and the translated Japanese versions. By manually correcting translation errors in a subset of the Japanese texts, they could quantify the impact of translation quality on LLM performance. This provides crucial insights for creating reliable ToM benchmarks in non-English languages, where the effects of translation errors must be considered.

Deep Dive

Investigating the Effects of Translation Quality on LLM Performance in Machine-Translated Theory of Mind Benchmarks

Overview

  • Recent years have seen the development of various benchmarks to evaluate the Theory of Mind (ToM) capabilities of large language models (LLMs)
  • Most ToM benchmarks are constructed in English, leaving a shortage of such benchmarks for other languages
  • A straightforward approach to creating non-English ToM benchmarks is to machine-translate existing English benchmarks, but the impact of translation errors on ToM evaluation remains unclear

Problem & Context

  • It is unknown how translation quality affects the performance of LLMs on ToM benchmarks
  • Understanding the effects of translation quality is crucial for creating reliable ToM benchmarks in non-English languages

Methodology

  • Selected two ToM benchmarks, ToMBench and FANToM, with distinct characteristics (narrative texts vs. dialogues)
  • Machine-translated the benchmarks from English to Japanese
  • Evaluated multiple state-of-the-art LLMs on the original English and machine-translated Japanese versions
  • Manually post-edited a subset of the Japanese translations to correct errors
  • Compared LLM accuracy before and after post-editing to investigate the impact of translation quality
  • Analyzed the relationship between the amount of translation errors in questions and their correctness using correlation analysis

Data & Experimental Setup

  • Benchmarks:
    • ToMBench (Chen et al. 2024): Uses narrative texts as context
    • FANToM (Kim et al. 2023): Uses dialogues as context
  • LLMs evaluated:
    • Multiple recent state-of-the-art models
  • Translation:
    • Machine-translated the benchmarks from English to Japanese
    • Manually post-edited a subset of the Japanese translations to correct errors

Results

  • Machine translation consistently decreased evaluation scores for ToMBench across all models
  • Impact of machine translation was limited for FANToM
  • Manual post-editing to correct translation errors improved LLM performance on ToMBench
  • Translation errors related to accuracy greatly degraded evaluation scores

Interpretation

  • The impact of translation quality differs across ToM benchmarks, likely due to their distinct characteristics
  • Accuracy-related translation errors have a significant impact on ToM evaluation results
  • Creating reliable non-English ToM benchmarks requires careful consideration of translation quality

Limitations & Uncertainties

  • Only two ToM benchmarks were investigated
  • Manual post-editing was only performed on a subset of the translations
  • Relationship between specific types of translation errors and evaluation performance was not fully explored

What Comes Next

  • Expand the investigation to more ToM benchmarks and language pairs
  • Conduct a more comprehensive analysis of the impact of different types of translation errors
  • Explore methods to improve machine translation quality for ToM benchmarks

Source