Curious Now

Story

PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

Artificial IntelligenceComputing

Key takeaway

Researchers created a massive Pashto language dataset to help improve AI language models for the 60 million people who speak this underrepresented language.

Read the paper

Quick Explainer

PashtoCorp is a large-scale corpus for the low-resource language of Pashto, assembled using a reproducible pipeline. The corpus was built by filtering a diverse set of sources, including web crawls, news, and digitized books, to retain only high-quality Pashto content. This expansive corpus enables substantial gains in language model performance, as demonstrated through improved results on Pashto natural language processing tasks like named entity recognition and reading comprehension. The key novelty is the size and breadth of the corpus, which is 40 times larger than the previous largest Pashto dataset, and the careful curation process that ensures high-quality data for language model pretraining.

Deep Dive

Technical Deep Dive: PashtoCorp

Overview

PashtoCorp is a 1.25-billion-word corpus for Pashto, a low-resource language spoken by 60 million people. It was assembled from 39 diverse sources, including web crawls, news websites, and digitized books and reports, through a reproducible pipeline. The corpus is 40x larger than the previous largest dedicated Pashto dataset, enabling substantial gains in language model performance.

Methodology

The corpus was built using a three-stage pipeline:

  1. Language Identification: Documents with less than 70% Pashto script tokens were discarded.
  2. Deduplication: Exact document duplicates were removed using SHA-256 hashing.
  3. Quality Filtering: Documents with fewer than 10 tokens were removed.

The resulting corpus contains 2.81 million documents across 39 sources, spanning categories like web crawls, news outlets, aggregators, and digitized books and reports.

Data & Experimental Setup

To evaluate the corpus, the researchers continued pretraining of the XLM-R-base model on PashtoCorp and tested the resulting model on several Pashto benchmarks:

  • Intrinsic Evaluation: Measured held-out perplexity and vocabulary overlap with downstream tasks.
  • Extrinsic Evaluation:
    • NER: Evaluated on the WikiANN Pashto dataset, showing a 10% relative improvement in F1 score and a 7x reduction in training variance.
    • Reading Comprehension: Tested on the Belebele Pashto dataset, where the Gemma-3n model achieved 64.6% accuracy, the first published LLM result for this benchmark.
    • Offensive Language Detection: No improvement was observed on the POLD social media dataset, likely due to the domain mismatch between the corpus and task.

The researchers also conducted ablation studies to quantify the contribution of different source domains to language model quality and downstream task performance.

Results

The key findings from the PashtoCorp evaluation include:

  • Continued pretraining of XLM-R-base on PashtoCorp reduced held-out perplexity by 25.1%, from 8.08 to 6.06.
  • On the WikiANN Pashto NER task, the pretrained model improved F1 score by 10% relative (19.0% to 21.0%) and reduced training variance by nearly 7x.
  • The largest gains on WikiANN NER were observed with just 50 training sentences, where the pretrained model improved F1 by 27% relative.
  • PashtoCorp covers 97.9% of the entity vocabulary in the WikiANN dataset, including 100% coverage of person names.
  • Ablation studies showed that small, diverse sources like Wikipedia and digitized books contributed disproportionately to vocabulary and downstream task performance, despite making up less than 1.5% of the corpus.
  • No improvements were observed on the POLD social media offensive language detection task, likely due to the domain mismatch between the corpus and task.

Limitations & Uncertainties

The authors note several limitations of the current work:

  • The reported NER and perplexity gains are likely lower bounds, as the corpus was only partially utilized during pretraining due to computational constraints.
  • The WikiANN Pashto NER dataset has only 100 training sentences, leading to high instability in the absolute F1 scores.
  • PashtoCorp is predominantly skewed toward Afghanistan-based news and web sources, underrepresenting Pakistani Pashto dialects and more informal/colloquial registers.
  • The 91.6% vocabulary coverage on the POLD social media dataset, compared to 95.9% on WikiANN, quantitatively explains the lack of performance improvement on that task.

What Comes Next

The authors outline several avenues for future work:

  • Training language models on the full 1.25-billion-word PashtoCorp corpus to maximize performance gains.
  • Developing Pashto-specific generative models and other downstream applications beyond the current benchmarks.
  • Extending corpus coverage to better represent Pakistani Pashto dialects and more informal/colloquial language registers.

All data, code, and trained models from this work have been released publicly, enabling further research and applications for this low-resource language.

Source