Instructions to use syvai/cohere-transcribe-diarize with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use syvai/cohere-transcribe-diarize with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="syvai/cohere-transcribe-diarize", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("syvai/cohere-transcribe-diarize", trust_remote_code=True) model = AutoModelForSpeechSeq2Seq.from_pretrained("syvai/cohere-transcribe-diarize", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Cohere Transcribe β Diarize + Timestamps (English)
Built by syv.ai β a Danish AI company focused on shipping practical speech and language models. We release open-weights speech models so teams can build on top without leaving their own infrastructure.
This model is CohereLabs/cohere-transcribe-03-2026 fine-tuned to also emit speaker labels and word-aligned timestamps in a single decoder pass, while preserving the base model's transcription quality. It's a drop-in replacement when you need to know who said what and when on short-form audio (β€ 30 s), and pairs with our diarize_long_vllm helper for arbitrary-length recordings.
Recommended deployment: vLLM β see Serving with vLLM. We measured 44Γ real-time end-to-end on a 10-min clip with one RTX 3090 (decode 113Γ RTF, embed 16 seg/s), and 249Γ peak throughput under concurrent load. Transformers works too and is shown first for a minimal example, but the vLLM path is what we run in production.
WE ARE LOOKING FOR COMPUTE PARTNERS TO FURTHER IMPROVE OUR MODELS - REACH OUT IF YOU CAN HELP
| Name | cohere-transcribe-diarize |
|---|---|
| Base model | CohereLabs/cohere-transcribe-03-2026 (Apache 2.0, 2 B params) |
| Architecture | conformer-based encoderβdecoder, full fine-tune (no LoRA) |
| Input | audio waveform (16 kHz mono, resampled automatically). Maximum supported clip length: 30 s β longer audio should be processed with sliding windows (see below) |
| Output | special-token stream interleaving speaker IDs, timestamps, and transcribed text, e.g. <|spltoken0|><|t:0.0|> Welcome back to the show.<|t:2.4|><|spltoken1|><|t:2.4|> Thanks for having me.<|t:3.8|> |
| Vocabulary extensions | 8 speaker tokens (<|spltoken0|>β¦<|spltoken7|>) + 300 timestamp tokens at 100 ms resolution (<|t:0.0|>β¦<|t:29.9|>) |
| Languages |
Primary: English (the diarization + timestamp fine-tune was done exclusively on English supervision). Likely usable (untested by us): the other 13 languages the Cohere Transcribe base supports β Arabic, German, Greek, Spanish, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Vietnamese, Chinese (Mandarin). The base model's multilingual transcription weights are preserved, and the diarization head conditions on language-agnostic speaker acoustics, so segmentation and speaker IDs should transfer; word-level timestamp accuracy will be best on English. Pass the matching language code in the prompt ( <|de|>, <|fr|>, β¦) to switch.
|
| License | Apache 2.0 (inherited from base) |
Quick start
pip install transformers==4.57.6 torch huggingface_hub soundfile librosa sentencepiece protobuf
import re
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.audio_utils import load_audio
MODEL_ID = "syvai/cohere-transcribe-diarize"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
MODEL_ID, dtype=torch.bfloat16
).to("cuda").eval()
# Prompt that activates diarization + timestamps. The base Cohere model
# uses special control tokens to switch features on/off; we keep that contract.
# `<|en|><|en|>` is the canonical Cohere prompt β the two slots are
# audio-language + transcript-language; setting them to the same code means
# "transcribe" (different codes would be "translate"). To run on another
# Cohere language, swap BOTH tokens, e.g. `<|de|><|de|>`.
# Each `<|...|>` is a single special token in the tokenizer vocab. Resolve
# via convert_tokens_to_ids β running the prompt string through the tokenizer
# re-tokenizes each marker into 6-12 subword pieces, which weakens the
# control-token signal the model trained on.
PROMPT_TOKENS = [
"<|startofcontext|>", "<|startoftranscript|>",
"<|emo:undefined|>", "<|en|>", "<|en|>",
"<|pnc|>", "<|noitn|>", "<|timestamp|>", "<|diarize|>",
]
prompt_ids = torch.tensor(
[[processor.tokenizer.convert_tokens_to_ids(t) for t in PROMPT_TOKENS]]
).to(model.device)
# Load any β€ 30 s audio clip.
audio = load_audio("clip.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to(model.device, dtype=model.dtype if v.is_floating_point() else None)
for k, v in inputs.items()}
with torch.inference_mode():
out = model.generate(
input_features=inputs["input_features"],
attention_mask=torch.ones(inputs["input_features"].shape[:2], device=model.device),
decoder_input_ids=prompt_ids,
max_new_tokens=400,
do_sample=False,
repetition_penalty=1.2, # baked into generation_config but explicit here
)
raw = processor.tokenizer.decode(out[0], skip_special_tokens=False)
print(raw)
# β <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>...
Parsing the output into structured segments
SEG_RE = re.compile(r"<\|spltoken(\d+)\|><\|t:(\d+\.\d+)\|>(.*?)<\|t:(\d+\.\d+)\|>", re.DOTALL)
# Drop the prompt prefix; the diarized text follows <|diarize|>
text = raw.split("<|diarize|>", 1)[-1].replace("<|endoftext|>", "")
segments = [
{
"speaker": int(m.group(1)),
"start": float(m.group(2)),
"end": float(m.group(4)),
"text": re.sub(r"<\|[^|]+\|>", "", m.group(3)).strip(),
}
for m in SEG_RE.finditer(text)
]
for s in segments:
print(f"[{s['start']:6.2f}β{s['end']:6.2f}] SPK{s['speaker']:02d} {s['text']}")
Output:
[ 0.00β 1.50] SPK00 Welcome back.
[ 1.50β 2.40] SPK01 Thanks for having me.
[ 2.40β 3.80] SPK00 Let's get into it.
The model uses 8 reusable speaker slots per clip (<|spltoken0|>β¦<|spltoken7|>). IDs are local to the clip β there is no global identity across separately decoded clips. For long-form audio that's split into windows, re-link windows with the helper below.
Long-form audio (> 30 s)
Audio longer than 30 s exceeds the encoder's maximum window. Two helpers in this repo do the windowing + cross-chunk speaker matching for you:
diarize_long_vllm.pyβ recommended. Calls a local vLLM server concurrently (continuous batching) and reuses one GPU for both decode and embedding. ~44Γ RTF on a 10-min clip on a single 3090.diarize_long.pyβ transformers-only fallback, no server needed. Slower (~7Γ RTF on the same clip) but minimal deps.
Both helpers:
- Slide 28 s windows with 2 s overlap over the full audio
- Decode each window with this model
- Embed each parsed segment with ReDimNet2 B6 (12 M params, 0.17 % EER, loaded automatically via
torch.hub) - Cluster embeddings globally with cosine-distance AHC so the same speaker keeps the same ID across windows
# Assumes vLLM is already serving (see next section)
python diarize_long_vllm.py podcast.wav \
--vllm http://127.0.0.1:8000 \
--model syvai/cohere-transcribe-diarize \
--language en \
--tau 0.45 \
--concurrency 32 \
--embed-batch 32
Or via the offline transformers helper (slower, no server):
from diarize_long import diarize_long_audio
segments = diarize_long_audio(
audio="podcast.wav",
diar_model_id="syvai/cohere-transcribe-diarize",
language="en",
chunk_s=28.0,
overlap_s=2.0,
cluster_threshold=0.45,
)
Additional dependencies for long-form inference: numpy, scipy, soundfile, torchaudio (required by ReDimNet2's feature extractor), plus aiohttp if using diarize_long_vllm.py.
Tuning the clustering threshold. cluster_threshold is the cosine-distance ceiling for AHC merges over ReDimNet2 embeddings. Around 0.45 is a good default for podcast / panel-style audio: a 2-min Bernie Sanders town-hall clip cleanly resolves Bernie as one consistent ID across all 5 sliding windows and the host as a second ID, while short audience interjections get their own IDs. Drop to 0.30β0.35 if the audio has many similar-sounding speakers; raise to 0.50β0.55 for noisier conditions where you'd rather collapse near-duplicate IDs.
Serving with vLLM (recommended)
The transformers code path above works but is single-stream. For production we run this model on vLLM 0.19.0 (note: 0.19.1 is broken) β it gives continuous batching, a custom OpenAI-compatible diarized_json response format, and ~25Γ higher peak throughput than calling model.generate() in a loop.
One-time setup
Two scripts ship with this repo to handle the setup β both idempotent:
# Download the model locally first, then patch it
hf download syvai/cohere-transcribe-diarize --local-dir cohere-transcribe-diarize
# 1. Reshape the checkpoint files for vLLM compatibility
python fix_for_vllm.py ./cohere-transcribe-diarize
fix_for_vllm.py makes three edits to your local copy:
tokenizer_config.json: drops the legacyextra_special_tokenslist (transformers 4.57+ expects a dict; the actual tokens are still intokenizer.json).config.json: setshead.num_classesandtransf_decoder.config_dict.vocab_sizeto16684(the resized vocab).model.safetensors: strips themodel.weight-name prefix and drops the BatchNormnum_batches_trackedtensors vLLM's CohereAsr model doesn't register.
# 2. Install vLLM 0.19.0 (NOT 0.19.1 β broken)
uv pip install "vllm==0.19.0" --torch-backend=cu128
uv pip install librosa
# 3. Patch vLLM's speech_to_text endpoint to add diarized_json
python vllm_diarized_patch.py
vllm_diarized_patch.py applies five edits inside the installed vLLM (also idempotent):
protocol.pyβ add"diarized_json"to theAudioResponseFormatenumprotocol.pyβ forceskip_special_tokens=Falseinto_sampling_paramsso<|spltoken*|>and<|t:*|>survive into the response textspeech_to_text.pyβ let the validator acceptresponse_format="diarized_json"speech_to_text.pyβ parse the raw token stream with the segment regex and return OpenAI-compatible{task, language, duration, text, segments:[{speaker, start, end, text}], speakers, usage}JSONapi_router.pyβ passJSONResponsereturns through unchanged (otherwise the diarized branch's return value gets misinterpreted as a streaming generator and the response body comes out empty)
Launch the server
vllm serve ./cohere-transcribe-diarize \
--served-model-name syvai/cohere-transcribe-diarize \
--trust-remote-code \
--host 127.0.0.1 --port 8000 \
--gpu-memory-utilization 0.55 # leaves ~10 GB for ReDimNet2 batching
--gpu-memory-utilization 0.55 is the sweet spot on a 24 GB card when you also run ReDimNet2 on the same GPU for long-form. If you only need short-form decode (β€ 30 s, no cross-chunk linking), bump it to 0.85 for better KV cache headroom.
Call the API
Plain transcription is OpenAI-compatible:
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F "file=@clip.wav" \
-F "model=syvai/cohere-transcribe-diarize" \
-F "language=en" \
-F "response_format=diarized_json" \
--form-string "prompt=<|startofcontext|><|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|timestamp|><|diarize|>"
Response shape (mirrors OpenAI's gpt-4o-transcribe-diarize):
{
"task": "transcribe",
"language": "en",
"duration": 28.0,
"text": "UM I REJECT THE IDEA I REALLY DO ...",
"segments": [
{"speaker": "SPEAKER_00", "start": 2.5, "end": 3.8, "text": "I REALLY DO"},
{"speaker": "SPEAKER_01", "start": 3.6, "end": 15.0, "text": "IT'S ONE OF THINGS THAT BOTHERS ME ..."},
{"speaker": "SPEAKER_02", "start": 15.5, "end": 28.0, "text": "IS RAISING A STARVATION MINIMUM WAGE ..."}
],
"speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
"usage": {"type": "duration", "seconds": 28}
}
The prompt field must be passed explicitly β vLLM's default prompt builder emits <|nodiarize|> which suppresses the speaker tokens.
Measured throughput (RTX 3090, 28 s clips)
| Concurrency | Throughput |
|---|---|
| 1 | 22Γ audio/wall |
| 8 | 117Γ |
| 32 | 171Γ |
| 128 | 249Γ (peak) |
vLLM does continuous (in-flight) batching automatically β fire concurrent requests at the endpoint and it batches them through one forward pass.
Training
This model was produced by full fine-tuning of CohereLabs/cohere-transcribe-03-2026 on English diarization data. The base vocabulary was extended with 8 speaker tokens and 300 100 ms timestamp tokens; the new rows of the embedding and LM-head matrices were initialised from the existing token embedding statistics.
| Dataset | Rows | Description |
|---|---|---|
| AMI SDM (train split) | 19,928 | Single-distant-microphone meeting recordings, sliding 28 s windows with 14 s hop, up to 4 simultaneous speakers per window. Provides realistic multi-speaker conversation with overlap, hesitations, and turn-taking. |
| LibriSpeech synthetic mix | 11,813 | Synthetic K-speaker mixtures (K weighted 0.2 / 0.3 / 0.3 / 0.2 for K=1β¦4) constructed from LibriSpeech utterances, with realistic gap silences. Provides clean cross-talk-free speaker examples to anchor the diarization head. |
| Total | 31,741 | All segments are β€ 30 s and capped at K β€ 4 speakers. |
Training ran for 2 epochs at peak LR 3e-4 (linear warmup over 100 optimizer steps, then linear decay to 0). Effective batch size 128 (per-device batch 2 Γ 64 gradient-accumulation), bf16, gradient checkpointing, AdamW8bit optimizer. The full fine-tune updates all 2 B parameters. repetition_penalty=1.2 is baked into the generation config and is required at inference β without it, K=4 outputs occasionally loop on a single speaker token.
Limitations
- 30 s hard cap per decoder pass β use
diarize_longfor longer audio. The Cohere feature extractor batches longer clips into multiple chunks, which the diarization decoder is not trained to consume. - K β€ 4 well-supported, K = 5β8 still emit but accuracy degrades on dense overlapping speech.
- Real-time factor β 14Γ on RTX 3090 at bf16 β the 2 B autoregressive decoder is the bottleneck. For >100Γ RTF on long audio, pair with a smaller segmenter (e.g. DiariZen-base) or use this model only on the highlight regions.
- Speaker IDs are local to each generate call. Always cluster embeddings across windows when working with audio that crosses the 30 s boundary.
Citation
If you use this model, please cite Cohere Labs' base release alongside this fine-tune:
@misc{cohere-transcribe-diarize-2026,
author = {{syv.ai}},
title = {Cohere Transcribe β Diarize + Timestamps (English)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/syvai/cohere-transcribe-diarize}},
}
License
Apache 2.0, inherited from the base model.
- Downloads last month
- 92
Model tree for syvai/cohere-transcribe-diarize
Base model
CohereLabs/cohere-transcribe-03-2026