🗣 COMPANY PROFILEAI-GENERATED EDITORIAL

22 Languages, One Model: How Sarvam is Building India’s AI Foundation

February 6, 2026 5 min read bharath.ai

Sarvam AINLPIndic Languages

When Vivek Raghavan and Pratyush Kumar founded Sarvam AI in 2023, the prevailing wisdom in Silicon Valley was that large language models would solve multilingual AI as a side effect of scale. Train a big enough model on enough internet data, and Hindi, Tamil, and Bengali would come for free. Sarvam’s founders — both veterans of Microsoft Research India and the AI4Bharat initiative — knew this was wrong.

The problem isn’t compute or scale. It’s data. The internet is overwhelmingly English. Hindi, the fourth most spoken language on Earth, accounts for less than 0.1% of the web’s content. For languages like Maithili, Dogri, or Santali — each spoken by millions — the figure is effectively zero. Training a general-purpose LLM on internet data and expecting it to serve India’s 22 scheduled languages is like training a doctor on English medical textbooks and expecting them to practice in rural Bihar.

Sarvam’s approach is architecturally different. Rather than fine-tuning existing English-centric models, they build from the ground up with Indic languages as the primary training objective. Their Sarvam-2B model — a 2-billion parameter model that fits on a smartphone — was trained on a curated corpus of 4.2 trillion tokens across all 22 scheduled languages, sourced from government documents, literary works, news archives, and transcribed speech. The result: 81.6% accuracy on IndicBench, outperforming models ten times its size.

Their flagship product, Sarvam Vision, takes this further. At 84.3% accuracy on Indic OCR across 22 languages, it surpasses GPT-5.2 — the first time an Indian model has beaten Silicon Valley’s best on any major benchmark. The implications are practical: a government official in Meghalaya can now photograph a Khasi-language document and get an accurate digital transcription. A Tamil shopkeeper can point a camera at a handwritten ledger and have it automatically entered into their accounting software.

Sarvam Bulbul V3, their text-to-speech system, generates natural-sounding speech in all 22 languages — including tonal variations, dialectal markers, and code-switching patterns that are ubiquitous in Indian speech. A user in Hyderabad who naturally mixes Telugu, Hindi, and English in conversation can interact with Bulbul without adjusting their speech pattern.

Sarvam has raised $41 million in Series A funding ($53.8 million total since inception), which will go toward expanding compute infrastructure, scaling their enterprise API platform (250+ clients including SBI, HDFC Bank, and the Government of India), and investing in their next-generation multilingual reasoning model, codenamed Sarvam-3. As a leading Indian AI startup, Sarvam is demonstrating that the world’s most capable AI models can be built in India, for India’s unique problems.

Sarvam’s mission is ultimately a civilisational one: that India’s AI future cannot be an import. It must be built by Indians, in Indian languages, for Indian problems. The market’s enthusiasm for Indian-led AI building efforts suggests the momentum is real.

Sources & References

Sarvam AI Blog ↗ · Inc42 ↗ · AI4Bharat ↗ · HuggingFace ↗

This article was generated by AI, synthesising information from the sources cited above. All claims are grounded in publicly verifiable data. Editorial oversight applied.

The Weekly Briefing

India's AI pulse, every Monday morning. Free forever.

India's Premier AI Chronicle

Explore bharath.ai →

22 Languages, One Model: How Sarvam is Building India’s AI Foundation

Sources & References

Read More on bharath.ai

The Weekly Briefing