Cohere Transcribe Just Made Whisper Obsolete
While everyone's debating which LLM to use, Cohere quietly dropped a speech recognition model that tops the HuggingFace Open ASR Leaderboard — and it's completely open source.
Cohere Transcribe is a 2B parameter Conformer-based model, trained from scratch, and it's now the most accurate dedicated ASR system you can download. No API required. No paywall. Just Apache 2.0.
Why This Matters for Builders
Speech-to-text has been dominated by two camps: OpenAI's Whisper (open but aging) and commercial APIs (ElevenLabs, Google, Azure). If you're building anything that touches audio — meeting notes, podcast search, voice agents, accessibility tools — you've been stuck choosing between "free but mediocre" or "great but expensive."
Cohere Transcribe breaks that trade-off. It beats Whisper Large v3, ElevenLabs Scribe v2, Qwen3-ASR, and even Zoom Scribe on average word error rate. And you can run it on your own hardware.
The Numbers
Here's how it stacks up on the HuggingFace Open ASR Leaderboard across 8 real-world benchmarks:
| Model | Avg WER |
|---|---|
| Cohere Transcribe | 5.42 |
| Zoom Scribe v1 | 5.47 |
| IBM Granite 4.0 1B Speech | 5.52 |
| NVIDIA Canary Qwen 2.5B | 5.63 |
| Whisper Large v3 | Higher |
The model handles accents (Voxpopuli), multi-speaker environments (AMI), and financial jargon (Earnings 22). That's not a lab benchmark — those are real-world conditions where most ASR systems fall apart.
What You Can Build With It
Since it's a downloadable 2B model with Apache 2.0 licensing, you can do basically anything:
- Meeting transcription apps — run locally for privacy-conscious teams
- Podcast search engines — index audio content without cloud costs
- Voice agent backends — pair with an LLM for real-time voice AI
- Accessibility tools — real-time captioning without API dependencies
- Content moderation — scan audio for policy violations at scale
- Subtitle generation — automate captions for video content
The Conformer Architecture
Under the hood, Cohere Transcribe uses a Conformer encoder (for acoustic feature extraction) paired with a lightweight Transformer decoder (for text generation). Audio goes in as a log-Mel spectrogram, text comes out.
Conformer models have been quietly outperforming pure Transformer approaches for ASR since 2020. They combine the local feature extraction of CNNs with the global context of attention — which is exactly what speech recognition needs.
The model was trained from scratch with cross-entropy loss. No distillation. No fine-tuning an existing model. This is a purpose-built system, and it shows in the benchmarks.
14 Languages, Not Just English
Coverage includes English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese (Mandarin), Japanese, Korean, Vietnamese, and Arabic. That's a serious chunk of the global population.
Most open-source ASR models focus heavily on English and treat everything else as an afterthought. Cohere built multilingual support into the training from day one.
Getting Started
The model is on HuggingFace right now:
pip install transformers torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"CohereLabs/cohere-transcribe-03-2026"
)
processor = AutoProcessor.from_pretrained(
"CohereLabs/cohere-transcribe-03-2026"
)
If you want managed inference without the ops headache, Cohere also offers it through Model Vault — their secure, fully managed platform. But the download is right there. No gated repos, no application forms.
The Bigger Picture
Speech is becoming a first-class modality in AI workflows. Not as a novelty — as a core input. Meeting transcription, real-time translation, voice-driven agents, audio search. These aren't demos anymore; they're products people pay for.
Cohere releasing this as open source with Apache 2.0 is a power move. They're betting that infrastructure wins by being ubiquitous, not by being locked down. It worked for Linux. It worked for React. It might work for ASR too.
If you've been putting off adding speech capabilities to your project because the tooling wasn't there — it's there now.
Building with AI? Stay sharp.
No fluff. No hype. Just the tools and stories that actually matter for builders.
z3n.iwnl →