Crane Nemo ASR 1.2

A streaming automatic speech recognition model for Luganda, Shona, and Swahili (with English retained), fine-tuned from nvidia/nemotron-3.5-asr-streaming-0.6b — a FastConformer Cache-Aware RNN-Transducer (~600M parameters). By CraneAI Labs.

The model transcribes conversational and read speech in real time (cache-aware streaming) and is conditioned on a language-ID prompt.

What's new in 1.2 — more training data, from the data we already had. Earlier versions dropped every training clip longer than 20 s (long conversational monologues), because the trainer can't fit them. 1.2 recovers them: each long clip is force-aligned (MMS aligner) and sliced into 3–10 s utterances, then folded back into training — about 56 additional hours of Luganda/Shona speech with no new data sources. Retrained on the full set (12k steps), this gives a clear gain on Luganda (WER 20.9% → 17.9%) and, via cross-lingual transfer, on Swahili (25.1% → 22.3%, even though Swahili received no new audio). The trade-off: English retention regressed slightly (WER 9.1% → 11.9%) as the mixture tilted toward the African languages — see the table below.

Results

Evaluated on a frozen, held-out set of 50 clips per language (excluded from training by both utterance ID and speaker ID). WER = word error rate, CER = character error rate (both under a shared normalization: NFKC, lowercased, punctuation-stripped, whitespace-collapsed), and Gemini = a meaning-preservation score from 1–5 (LLM-as-judge), which better reflects intelligibility for morphologically rich Bantu languages where word-level WER is harsh.

Language	CER ↓	WER ↓	Gemini (1–5) ↑
Luganda (`lg`)	4.3%	17.9%	4.95
Shona (`sn`)	5.9%	26.6%	4.55
Swahili (`sw`)	6.8%	22.3%	4.53
English (`en`)	5.5%	11.9%	4.60

Change from 1.1:

Language	CER (1.1 → 1.2)	WER (1.1 → 1.2)
Luganda	5.1% → 4.3%	20.9% → 17.9%
Swahili	7.3% → 6.8%	25.1% → 22.3%
Shona	6.0% → 5.9%	27.3% → 26.6%
English	3.6% → 5.5%	9.1% → 11.9%

Luganda and Swahili improve clearly; Shona improves slightly; English (the retention anchor, not a target language) regresses but stays usable, with its meaning-preservation score essentially unchanged (4.65 → 4.60). If English performance is critical for your use case, 1.1 may be the better choice.

Starting point (before fine-tuning). On these African languages the base model is effectively unusable — roughly 100% WER. Fine-tuning brings them to the figures above.

How WER/CER are calculated. Predictions and references pass through one shared normalizer (NFKC, lowercasing, punctuation removal, whitespace collapsing) and are scored with jiwer, aggregated across the whole eval set (total edits ÷ total reference units), not averaged per clip.

Note on metrics: CER and the Gemini meaning-score are the most faithful quality signals for these languages. WER runs higher than CER largely because rich agglutinative morphology and word-boundary conventions penalize the word-level metric even when the transcription is intelligible.

Training

Method: full fine-tune — the entire encoder (all 24 layers) trained together with the prediction and joint networks (~633M trainable parameters, 99% of the model).
Optimizer: 8-bit AdamW, learning rate 3.46e-4 (selected by a coarse-to-fine sweep), cosine schedule with 200 warmup steps, 12,000 steps.
Precision: bf16-mixed. Hardware: a single NVIDIA L40 (48 GB).
Data recovery: long clips (>20 s) that the trainer drops were force-aligned with the MMS aligner (torchaudio.pipelines.MMS_FA) and re-segmented into 3–10 s utterances (virtual offset/duration slices), adding ~56 h of Luganda/Shona speech from the existing sources.
Decoding/inference: cache-aware streaming, fp32, attention context [56, 13] (~1.12 s lookahead).

Training data

Language	Approx. hours
Luganda	~32h (incl. ~19h recovered long-form)
Shona	~54h (incl. ~38h recovered long-form)
Swahili	~9h
English	~2h (retention)

Data is read and conversational speech from openly available African-language speech corpora; a small English slice is included to retain English performance.

Intended use

Transcription of Luganda, Shona, Swahili, and English speech, including low-latency streaming applications.
A strong, deployable model for these languages and a starting point for further fine-tuning.

Limitations

Results are reported on 50 clips per language — directional, not a large-scale benchmark.
Cache-aware streaming inference is fp32-only.
Best results use the language-ID prompt set to auto-detect; Luganda and Shona have no dedicated prompt slot, so auto-detect is recommended for them.
English retention regressed vs 1.1 (WER 9.1% → 11.9%) — the cost of the African-language-heavy training mixture. Use 1.1 if English is a priority.
Much of the new training audio is segmented long-form conversational speech; very out-of-domain acoustics (heavy noise, far-field mics, dense code-switching) are still only partially handled.
Not validated for safety-critical or medical/legal transcription.

Usage

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.restore_from("crane-nemo-asr-1.2.nemo")
transcripts = model.transcribe(["audio.wav"])
print(transcripts)

For cache-aware streaming inference, use NeMo's examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py with compute_dtype=float32, att_context_size=[56,13], and target_lang=auto.

License

This model inherits the NVIDIA Open Model License from its base model.

Downloads last month: 7

Model tree for CraneAILabs/crane-nemo-asr-1.2

Base model

nvidia/nemotron-3.5-asr-streaming-0.6b

Finetuned

(28)

this model