Crane Nemo ASR 1.2

A streaming automatic speech recognition model for Luganda, Shona, and Swahili (with English retained), fine-tuned from nvidia/nemotron-3.5-asr-streaming-0.6b β€” a FastConformer Cache-Aware RNN-Transducer (~600M parameters). By CraneAI Labs.

The model transcribes conversational and read speech in real time (cache-aware streaming) and is conditioned on a language-ID prompt.

What's new in 1.2 β€” more training data, from the data we already had. Earlier versions dropped every training clip longer than 20 s (long conversational monologues), because the trainer can't fit them. 1.2 recovers them: each long clip is force-aligned (MMS aligner) and sliced into 3–10 s utterances, then folded back into training β€” about 56 additional hours of Luganda/Shona speech with no new data sources. Retrained on the full set (12k steps), this gives a clear gain on Luganda (WER 20.9% β†’ 17.9%) and, via cross-lingual transfer, on Swahili (25.1% β†’ 22.3%, even though Swahili received no new audio). The trade-off: English retention regressed slightly (WER 9.1% β†’ 11.9%) as the mixture tilted toward the African languages β€” see the table below.

Results

Evaluated on a frozen, held-out set of 50 clips per language (excluded from training by both utterance ID and speaker ID). WER = word error rate, CER = character error rate (both under a shared normalization: NFKC, lowercased, punctuation-stripped, whitespace-collapsed), and Gemini = a meaning-preservation score from 1–5 (LLM-as-judge), which better reflects intelligibility for morphologically rich Bantu languages where word-level WER is harsh.

Language CER ↓ WER ↓ Gemini (1–5) ↑
Luganda (lg) 4.3% 17.9% 4.95
Shona (sn) 5.9% 26.6% 4.55
Swahili (sw) 6.8% 22.3% 4.53
English (en) 5.5% 11.9% 4.60

Change from 1.1:

Language CER (1.1 β†’ 1.2) WER (1.1 β†’ 1.2)
Luganda 5.1% β†’ 4.3% 20.9% β†’ 17.9%
Swahili 7.3% β†’ 6.8% 25.1% β†’ 22.3%
Shona 6.0% β†’ 5.9% 27.3% β†’ 26.6%
English 3.6% β†’ 5.5% 9.1% β†’ 11.9%

Luganda and Swahili improve clearly; Shona improves slightly; English (the retention anchor, not a target language) regresses but stays usable, with its meaning-preservation score essentially unchanged (4.65 β†’ 4.60). If English performance is critical for your use case, 1.1 may be the better choice.

Starting point (before fine-tuning). On these African languages the base model is effectively unusable β€” roughly 100% WER. Fine-tuning brings them to the figures above.

How WER/CER are calculated. Predictions and references pass through one shared normalizer (NFKC, lowercasing, punctuation removal, whitespace collapsing) and are scored with jiwer, aggregated across the whole eval set (total edits Γ· total reference units), not averaged per clip.

Note on metrics: CER and the Gemini meaning-score are the most faithful quality signals for these languages. WER runs higher than CER largely because rich agglutinative morphology and word-boundary conventions penalize the word-level metric even when the transcription is intelligible.

Training

  • Method: full fine-tune β€” the entire encoder (all 24 layers) trained together with the prediction and joint networks (~633M trainable parameters, 99% of the model).
  • Optimizer: 8-bit AdamW, learning rate 3.46e-4 (selected by a coarse-to-fine sweep), cosine schedule with 200 warmup steps, 12,000 steps.
  • Precision: bf16-mixed. Hardware: a single NVIDIA L40 (48 GB).
  • Data recovery: long clips (>20 s) that the trainer drops were force-aligned with the MMS aligner (torchaudio.pipelines.MMS_FA) and re-segmented into 3–10 s utterances (virtual offset/duration slices), adding ~56 h of Luganda/Shona speech from the existing sources.
  • Decoding/inference: cache-aware streaming, fp32, attention context [56, 13] (~1.12 s lookahead).

Training data

Language Approx. hours
Luganda ~32h (incl. ~19h recovered long-form)
Shona ~54h (incl. ~38h recovered long-form)
Swahili ~9h
English ~2h (retention)

Data is read and conversational speech from openly available African-language speech corpora; a small English slice is included to retain English performance.

Intended use

  • Transcription of Luganda, Shona, Swahili, and English speech, including low-latency streaming applications.
  • A strong, deployable model for these languages and a starting point for further fine-tuning.

Limitations

  • Results are reported on 50 clips per language β€” directional, not a large-scale benchmark.
  • Cache-aware streaming inference is fp32-only.
  • Best results use the language-ID prompt set to auto-detect; Luganda and Shona have no dedicated prompt slot, so auto-detect is recommended for them.
  • English retention regressed vs 1.1 (WER 9.1% β†’ 11.9%) β€” the cost of the African-language-heavy training mixture. Use 1.1 if English is a priority.
  • Much of the new training audio is segmented long-form conversational speech; very out-of-domain acoustics (heavy noise, far-field mics, dense code-switching) are still only partially handled.
  • Not validated for safety-critical or medical/legal transcription.

Usage

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.restore_from("crane-nemo-asr-1.2.nemo")
transcripts = model.transcribe(["audio.wav"])
print(transcripts)

For cache-aware streaming inference, use NeMo's examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py with compute_dtype=float32, att_context_size=[56,13], and target_lang=auto.

License

This model inherits the NVIDIA Open Model License from its base model.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for CraneAILabs/crane-nemo-asr-1.2

Finetuned
(28)
this model