Instructions to use RobotsMali/soloni-be-kalan-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use RobotsMali/soloni-be-kalan-v0 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("RobotsMali/soloni-be-kalan-v0") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Soloni Be Kalan (TDT-CTC 114M)
soloni-be-kalan-v0 is a domain-specific fine-tuned version of RobotsMali/soloni-114m-tdt-ctc-v2. This model was adapted specifically for Bambara educational materials and child speech applications. The model was fine-tuned using NVIDIA NeMo and supports both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding.
🚨 Important Note
This model, along with its associated resources, is part of an ongoing research effort by the RobotsMali AI4D Lab. Users should be aware that:
* Early Childhood Performance Gap: While this model significantly reduces the baseline error rate on early childhood speech (<10 years) from 56% down to 29% WER, physiological features unique to young children (unformed acoustic profiles, erratic speech rates) continue to present an out-of-domain challenge compared to older cohorts.
* Structural Dependencies: The model performs exceptionally well on fluid, sequential storytelling text (achieving down to 7% WER) , but faces structural limitations on short, highly repetitive, sparse token arrays (e.g., Kuloriw or Jate) where language model prior biases dominate.
NVIDIA NeMo: Training
To fine-tune or run inference with this model, you will need to install NVIDIA NeMo. We recommend installing it alongside a compatible PyTorch environment.
pip install nemo-toolkit['asr']
How to Use This Model
Load Model with NeMo
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="RobotsMali/soloni-be-kalan-v0")
Transcribe Audio
# Accepts 16 kHz mono-channel audio wav files (no need to resample manually if your audio isn't 16kHz, the preprocessor will do)
asr_model.transcribe(['sample_child_reading.wav'])
If you encounter a RuntimeError: CUDA error: invalid argument due to GPU compatibility with CUDA Graphs in the TDT decoder, disable it in your configuration before transcribing:
decoding_cfg = asr_model.cfg.decoding
decoding_cfg.greedy.use_cuda_graph_decoder = False
asr_model.change_decoding_strategy(decoding_cfg=decoding_cfg)
Model Architecture
This model utilizes the Hybrid FastConformer-TDT-CTC architecture with 114 million parameters. FastConformer optimizes the standard Conformer model with 8x depthwise-separable convolutional downsampling. It features two independent but jointly trained decoders: an auto-regressive TDT decoder (default branch) and a convolutional decoder optimized via CTC loss.
Training & Fine-Tuning Configurations
Fine-tuning was executed under strict low-resource optimization constraints:
* Optimization Framework: Regularized using an Early Stopping mechanism with a 15-epoch patience window based on a validation set matching the benchmark distribution.
* Convergence Behavior: Due to high acoustic data density acting as a natural regularizer, training safely concluded at epoch 20, successfully preventing the vocabulary collapse and lexical overfitting typical of small, pristine speech corpora.
*
Augmentation Strategy: This configuration explicitly omitted synthetic spectral masking (SpecAugment=None) , demonstrating that physical voice variance from natural human speakers acts as a superior regularizer than artificial noise injection in this specific domain.
Dataset
The model was fine-tuned on the combined Main + Duplicate expanded subset (totaling 45.6 hours) of the RobotsMali/an-be-kalan-bench dataset.
* Main Split (1.6h): Clean readings of 22 GAIFE project books recorded by 8 unique speakers.
* Duplicate Split (44h): A highly dense, multi-speaker redundant corpus featuring natural human speech variations (pitch, accent, child speech dynamics) reading the identical source literature.
Performance
Performance is disaggregated below across overall results, specific age cohorts, and distinctive book structures using Word Error Rate (WER) and Character Error Rate (CER).
Overall Evaluation
| Model | Decoding Branch | WER (%) ↓ | CER (%) ↓ | Status |
|---|---|---|---|---|
| soloni-be-kalan | CTC | 22.0% | 8.0% | Newly deployed model in An be Kalan app |
| soloni-114m-tdt-ctc-v2 (Base) | CYC | 42.0% | 15.0% | Pre-trained Baseline Reference |
Demographic Cohort Breakdown
| Age Cohort | Utterance Count | Baseline WER (%) | Fine-Tuned WER (%) | Key Insights |
|---|---|---|---|---|
| Early Childhood (<10 yrs) | 93 | 56.0% | 29.0% | Remains the single largest acoustic error cluster. |
| Target Cohort (10-15 yrs) | 527 | — | 22.0% | Majority representation; stable acoustic profiles. |
License
This model is released under the CC-BY-4.0 license.
- Downloads last month
- 6
Model tree for RobotsMali/soloni-be-kalan-v0
Base model
nvidia/parakeet-tdt_ctc-110mDataset used to train RobotsMali/soloni-be-kalan-v0
Evaluation results
- Test WER on An be kalan Children's Reading Benchmarktest set self-reported22.000
- Test CER on An be kalan Children's Reading Benchmarktest set self-reported8.000