Sophea-Canary-ASR

A bilingual (Greek + English), multitask speech model fine-tuned from nvidia/canary-1b-v2 (1 B, offline attention-encoder-decoder, FastConformer). One model does four jobs:

Task Test set Score
Greek ASR FLEURS (el) 12.69 % WER
Greek ASR Common Voice 17 (el) 2.90 % WER
English ASR FLEURS (en) 12.90 % WER
Greek -> English speech translation FLEURS (el->en) BLEU 24.13

It also handles medical and legal Greek domains (in the fine-tuning mix). This is the offline, high-accuracy model of a two-model Greek ASR system; a separate streaming 0.6 B model serves the real-time path.

Multitask results

Highlights

  • Halves Greek WER vs the streaming baseline (25.5 % -> 12.7 % FLEURS) and drives Common Voice to 2.9 %.
  • Bilingual for free. Despite fine-tuning on ~360 h of mostly-Greek data, English ASR held at 12.9 % (base canary-1b-v2: 12.0 %) — Canary's source_lang/target_lang prompt-conditioning preserves the English pathway (no catastrophic forgetting).
  • Translation for free. Greek->English speech translation (BLEU 24.1) was retained from the base through ASR-only fine-tuning — the prompt selects the task, so the AST pathway survives.

Results in context

Streaming vs offline

Streaming 0.6 B (real-time) This model (offline 1 B)
Greek FLEURS WER 25.5 % 12.7 %
Greek Common Voice WER 11.5 % 2.9 %
English FLEURS WER 12.9 %
Greek->English BLEU 24

All numbers are held-out test sets scored with literal WER (casing + punctuation). Note this is an offline (full-context) model — its WER is not directly comparable to a streaming model's latency-bounded numbers.

Usage

Requires NeMo (pip install nemo_toolkit[asr]).

from nemo.collections.asr.models import EncDecMultiTaskModel
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download("KIEFERSA/KIEFERSA/Sophea-Canary-ASR",
                       "canary1bv2_el_stage3.nemo")
model = EncDecMultiTaskModel.restore_from(ckpt)

# Greek ASR
print(model.transcribe(["greek_audio.wav"], source_lang="el", target_lang="el", pnc="yes"))

# English ASR
print(model.transcribe(["english_audio.wav"], source_lang="en", target_lang="en", pnc="yes"))

# Greek -> English speech translation
print(model.transcribe(["greek_audio.wav"], source_lang="el", target_lang="en", pnc="yes"))

Prompt-sensitivity note. This model is fine-tuned with the canary2 prompt and pnc=yes. Pass source_lang/target_lang explicitly (as above) and keep pnc=yes for results matching the reported metrics. Audio should be 16 kHz mono.

Training

  • Base: nvidia/canary-1b-v2 (1 B, offline AED, FastConformer encoder).
  • Method: full fine-tune via a custom NeMo launcher that restores the pretrained model and swaps the dataset (avoiding the tokenizer rebuild ..), with a 3-stage warm-start chain (each stage initialized from the previous endpoint):
    1. stage-1 — ~250 h Greek ASR (FLEURS + Common Voice + YODAS + TEDx + targeted TTS).
    2. stage-2 — + ~80 h clean Greek (medical, legal, parliamentary) -> 330 h.
    3. stage-3 (this model) — + 30 h English ASR (LibriSpeech-clean) -> 360 h.
  • Optimizer: AdamW, cosine schedule, lr ~8e-6 (stage-3), bf16, Lhotse dynamic batching.
  • Checkpoint selection: best Greek val_wer (0.0577 at stage-3).

Data

Source Role Notes
FLEURS (el / en) Greek + English ASR clean read speech
Common Voice 17 (el) Greek ASR spontaneous, multi-speaker
YODAS (el) Greek ASR YouTube subtitles, quality-filtered
TEDx (el) Greek ASR real talks
TTS-synthetic (el) Greek ASR Wikipedia/general domain, ~10 % of mix
Medical / legal (el) Greek ASR domain coverage
LibriSpeech (en) English ASR retention

Limitations

  • Offline only. Full-context AED — not for streaming/real-time use.
  • Translation ceiling. Greek->English BLEU (~24) is inherited from the base; an explicit AST fine-tune on machine-translated data did not improve it (you cannot out-train your labels). Real parallel data would be needed to push past this.
  • Domain bias. Strongest on read speech (FLEURS) and Common Voice; other domains (e.g. heavy dialect, far-field, overlapping speech) are untested.
  • Common Voice over-specialization. The validation mix is CV-heavy, so CV WER (2.9 %) is partly in-distribution; FLEURS (12.7 %) is the more conservative real-world estimate.

License

Released under CC-BY-4.0, inheriting the license of the base model nvidia/canary-1b-v2.

Citation

If you use this model, please credit this repository and the base model:

@misc{kiefer2026canarygreek,
  title  = {Sophea-Canary-ASR: a bilingual Greek+English multitask speech model},
  author = {Kirouane, Ayoub},
  year   = {2026},
  howpublished = {Hugging Face, KIEFERSA},
  note   = {Fine-tuned from nvidia/canary-1b-v2}
}
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KIEFERSA/Sophea-Canary-ASR

Finetuned
(9)
this model

Datasets used to train KIEFERSA/Sophea-Canary-ASR

Evaluation results