Sophea-Canary-ASR

A bilingual (Greek + English), multitask speech model fine-tuned from nvidia/canary-1b-v2 (1 B, offline attention-encoder-decoder, FastConformer). One model does four jobs:

Task	Test set	Score
Greek ASR	FLEURS (el)	12.69 % WER
Greek ASR	Common Voice 17 (el)	2.90 % WER
English ASR	FLEURS (en)	12.90 % WER
Greek -> English speech translation	FLEURS (el->en)	BLEU 24.13

It also handles medical and legal Greek domains (in the fine-tuning mix). This is the offline, high-accuracy model of a two-model Greek ASR system; a separate streaming 0.6 B model serves the real-time path.

Highlights

Halves Greek WER vs the streaming baseline (25.5 % -> 12.7 % FLEURS) and drives Common Voice to 2.9 %.
Bilingual for free. Despite fine-tuning on ~360 h of mostly-Greek data, English ASR held at 12.9 % (base canary-1b-v2: 12.0 %) — Canary's source_lang/target_lang prompt-conditioning preserves the English pathway (no catastrophic forgetting).
Translation for free. Greek->English speech translation (BLEU 24.1) was retained from the base through ASR-only fine-tuning — the prompt selects the task, so the AST pathway survives.

Results in context

	Streaming 0.6 B (real-time)	This model (offline 1 B)
Greek FLEURS WER	25.5 %	12.7 %
Greek Common Voice WER	11.5 %	2.9 %
English FLEURS WER	—	12.9 %
Greek->English	—	BLEU 24

All numbers are held-out test sets scored with literal WER (casing + punctuation). Note this is an offline (full-context) model — its WER is not directly comparable to a streaming model's latency-bounded numbers.

Usage

Requires NeMo (pip install nemo_toolkit[asr]).

from nemo.collections.asr.models import EncDecMultiTaskModel
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download("KIEFERSA/KIEFERSA/Sophea-Canary-ASR",
                       "canary1bv2_el_stage3.nemo")
model = EncDecMultiTaskModel.restore_from(ckpt)

# Greek ASR
print(model.transcribe(["greek_audio.wav"], source_lang="el", target_lang="el", pnc="yes"))

# English ASR
print(model.transcribe(["english_audio.wav"], source_lang="en", target_lang="en", pnc="yes"))

# Greek -> English speech translation
print(model.transcribe(["greek_audio.wav"], source_lang="el", target_lang="en", pnc="yes"))

Prompt-sensitivity note. This model is fine-tuned with the canary2 prompt and pnc=yes. Pass source_lang/target_lang explicitly (as above) and keep pnc=yes for results matching the reported metrics. Audio should be 16 kHz mono.

Training

Base: nvidia/canary-1b-v2 (1 B, offline AED, FastConformer encoder).
Method: full fine-tune via a custom NeMo launcher that restores the pretrained model and swaps the dataset (avoiding the tokenizer rebuild ..), with a 3-stage warm-start chain (each stage initialized from the previous endpoint):
1. stage-1 — ~250 h Greek ASR (FLEURS + Common Voice + YODAS + TEDx + targeted TTS).
2. stage-2 — + ~80 h clean Greek (medical, legal, parliamentary) -> 330 h.
3. stage-3 (this model) — + 30 h English ASR (LibriSpeech-clean) -> 360 h.
Optimizer: AdamW, cosine schedule, lr ~8e-6 (stage-3), bf16, Lhotse dynamic batching.
Checkpoint selection: best Greek val_wer (0.0577 at stage-3).

Data

Source	Role	Notes
FLEURS (el / en)	Greek + English ASR	clean read speech
Common Voice 17 (el)	Greek ASR	spontaneous, multi-speaker
YODAS (el)	Greek ASR	YouTube subtitles, quality-filtered
TEDx (el)	Greek ASR	real talks
TTS-synthetic (el)	Greek ASR	Wikipedia/general domain, ~10 % of mix
Medical / legal (el)	Greek ASR	domain coverage
LibriSpeech (en)	English ASR	retention

Limitations

Offline only. Full-context AED — not for streaming/real-time use.
Translation ceiling. Greek->English BLEU (~24) is inherited from the base; an explicit AST fine-tune on machine-translated data did not improve it (you cannot out-train your labels). Real parallel data would be needed to push past this.
Domain bias. Strongest on read speech (FLEURS) and Common Voice; other domains (e.g. heavy dialect, far-field, overlapping speech) are untested.
Common Voice over-specialization. The validation mix is CV-heavy, so CV WER (2.9 %) is partly in-distribution; FLEURS (12.7 %) is the more conservative real-world estimate.

License

Released under CC-BY-4.0, inheriting the license of the base model nvidia/canary-1b-v2.

Citation

If you use this model, please credit this repository and the base model:

@misc{kiefer2026canarygreek,
  title  = {Sophea-Canary-ASR: a bilingual Greek+English multitask speech model},
  author = {Kirouane, Ayoub},
  year   = {2026},
  howpublished = {Hugging Face, KIEFERSA},
  note   = {Fine-tuned from nvidia/canary-1b-v2}
}

Downloads last month: 6

Model tree for KIEFERSA/Sophea-Canary-ASR

Base model

nvidia/canary-1b-v2

Finetuned

(9)

this model

Datasets used to train KIEFERSA/Sophea-Canary-ASR

Evaluation results

Test WER (el) on FLEURS (Greek)
test set self-reported

12.690
Test WER (el) on Common Voice 17.0 (Greek)
test set self-reported

2.900
Test WER (en) on FLEURS (English)
test set self-reported

12.900
Test BLEU (el->en) on FLEURS (Greek to English)
test set self-reported

24.130