All posts
Research

Why LLMs need Wolof, Fang, and Lingala

GPT-4, Claude, Gemini — the most capable language models in the world — are built overwhelmingly on English text. When they handle other languages, they rely on translated or web-scraped data that rarely captures natural speech patterns, dialectal variation, or cultural context.

For the 4 billion people who speak languages underrepresented in AI training data, this means AI assistants that do not understand them, speech recognition that fails, and translation tools that produce awkward, sometimes offensive output.

The problem is not technical. The models can learn any language given enough quality data. The problem is data supply.

Web scraping does not work for languages like Wolof (10 million speakers, minimal web presence), Fang (1 million speakers, almost no digital text), or Lingala (25 million speakers, largely oral tradition). You cannot scrape what does not exist online.

OraData solves this by paying native speakers to produce structured audio corpora. Our protocol requires studio-grade recording (48kHz/24bit), IPA transcription aligned at the word level, and validation by a second native speaker from a different region to catch dialectal bias.

Each recording session captures metadata: speaker age, region of origin, dialect self-identification, education level, and recording environment. This lets AI teams build models that understand not just a language, but its internal diversity.

We currently collect data in Wolof (Senegal), Fang (Gabon/Cameroon), Lingala (Congo), Hausa (Nigeria), Yoruba (Nigeria), Beti (Cameroon), and Somali (Horn of Africa). More languages are added as qualified native speaker teams are onboarded.

The goal is not to digitize these languages for preservation — though that is a welcome side effect. The goal is to make AI work for everyone, not just English speakers. And that starts with real data from real speakers.