Instructions to use KIEFERSA/Sophea-Canary-ASR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use KIEFERSA/Sophea-Canary-ASR with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("KIEFERSA/Sophea-Canary-ASR") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Sophea-Canary-ASR
A bilingual (Greek + English), multitask speech model fine-tuned from
nvidia/canary-1b-v2 (1 B, offline
attention-encoder-decoder, FastConformer). One model does four jobs:
| Task | Test set | Score |
|---|---|---|
| Greek ASR | FLEURS (el) | 12.69 % WER |
| Greek ASR | Common Voice 17 (el) | 2.90 % WER |
| English ASR | FLEURS (en) | 12.90 % WER |
| Greek -> English speech translation | FLEURS (el->en) | BLEU 24.13 |
It also handles medical and legal Greek domains (in the fine-tuning mix). This is the offline, high-accuracy model of a two-model Greek ASR system; a separate streaming 0.6 B model serves the real-time path.
Highlights
- Halves Greek WER vs the streaming baseline (25.5 % -> 12.7 % FLEURS) and drives Common Voice to 2.9 %.
- Bilingual for free. Despite fine-tuning on ~360 h of mostly-Greek data, English
ASR held at 12.9 % (base canary-1b-v2: 12.0 %) — Canary's
source_lang/target_langprompt-conditioning preserves the English pathway (no catastrophic forgetting). - Translation for free. Greek->English speech translation (BLEU 24.1) was retained from the base through ASR-only fine-tuning — the prompt selects the task, so the AST pathway survives.
Results in context
| Streaming 0.6 B (real-time) | This model (offline 1 B) | |
|---|---|---|
| Greek FLEURS WER | 25.5 % | 12.7 % |
| Greek Common Voice WER | 11.5 % | 2.9 % |
| English FLEURS WER | — | 12.9 % |
| Greek->English | — | BLEU 24 |
All numbers are held-out test sets scored with literal WER (casing + punctuation). Note this is an offline (full-context) model — its WER is not directly comparable to a streaming model's latency-bounded numbers.
Usage
Requires NeMo (pip install nemo_toolkit[asr]).
from nemo.collections.asr.models import EncDecMultiTaskModel
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download("KIEFERSA/KIEFERSA/Sophea-Canary-ASR",
"canary1bv2_el_stage3.nemo")
model = EncDecMultiTaskModel.restore_from(ckpt)
# Greek ASR
print(model.transcribe(["greek_audio.wav"], source_lang="el", target_lang="el", pnc="yes"))
# English ASR
print(model.transcribe(["english_audio.wav"], source_lang="en", target_lang="en", pnc="yes"))
# Greek -> English speech translation
print(model.transcribe(["greek_audio.wav"], source_lang="el", target_lang="en", pnc="yes"))
Prompt-sensitivity note. This model is fine-tuned with the canary2 prompt and
pnc=yes. Pass source_lang/target_lang explicitly (as above) and keep pnc=yes
for results matching the reported metrics. Audio should be 16 kHz mono.
Training
- Base:
nvidia/canary-1b-v2(1 B, offline AED, FastConformer encoder). - Method: full fine-tune via a custom NeMo launcher that restores the pretrained
model and swaps the dataset (avoiding the tokenizer rebuild ..),
with a 3-stage warm-start chain (each stage initialized from the previous endpoint):
- stage-1 — ~250 h Greek ASR (FLEURS + Common Voice + YODAS + TEDx + targeted TTS).
- stage-2 — + ~80 h clean Greek (medical, legal, parliamentary) -> 330 h.
- stage-3 (this model) — + 30 h English ASR (LibriSpeech-clean) -> 360 h.
- Optimizer: AdamW, cosine schedule, lr ~8e-6 (stage-3), bf16, Lhotse dynamic batching.
- Checkpoint selection: best Greek
val_wer(0.0577 at stage-3).
Data
| Source | Role | Notes |
|---|---|---|
| FLEURS (el / en) | Greek + English ASR | clean read speech |
| Common Voice 17 (el) | Greek ASR | spontaneous, multi-speaker |
| YODAS (el) | Greek ASR | YouTube subtitles, quality-filtered |
| TEDx (el) | Greek ASR | real talks |
| TTS-synthetic (el) | Greek ASR | Wikipedia/general domain, ~10 % of mix |
| Medical / legal (el) | Greek ASR | domain coverage |
| LibriSpeech (en) | English ASR | retention |
Limitations
- Offline only. Full-context AED — not for streaming/real-time use.
- Translation ceiling. Greek->English BLEU (~24) is inherited from the base; an explicit AST fine-tune on machine-translated data did not improve it (you cannot out-train your labels). Real parallel data would be needed to push past this.
- Domain bias. Strongest on read speech (FLEURS) and Common Voice; other domains (e.g. heavy dialect, far-field, overlapping speech) are untested.
- Common Voice over-specialization. The validation mix is CV-heavy, so CV WER (2.9 %) is partly in-distribution; FLEURS (12.7 %) is the more conservative real-world estimate.
License
Released under CC-BY-4.0, inheriting the license of the base model
nvidia/canary-1b-v2.
Citation
If you use this model, please credit this repository and the base model:
@misc{kiefer2026canarygreek,
title = {Sophea-Canary-ASR: a bilingual Greek+English multitask speech model},
author = {Kirouane, Ayoub},
year = {2026},
howpublished = {Hugging Face, KIEFERSA},
note = {Fine-tuned from nvidia/canary-1b-v2}
}
- Downloads last month
- 6
Model tree for KIEFERSA/Sophea-Canary-ASR
Base model
nvidia/canary-1b-v2Datasets used to train KIEFERSA/Sophea-Canary-ASR
google/fleurs
mozilla-foundation/common_voice_17_0
Evaluation results
- Test WER (el) on FLEURS (Greek)test set self-reported12.690
- Test WER (el) on Common Voice 17.0 (Greek)test set self-reported2.900
- Test WER (en) on FLEURS (English)test set self-reported12.900
- Test BLEU (el->en) on FLEURS (Greek to English)test set self-reported24.130

