Soloni Be Kalan (TDT-CTC 114M)

| |

soloni-be-kalan-v0 is a domain-specific fine-tuned version of RobotsMali/soloni-114m-tdt-ctc-v2. This model was adapted specifically for Bambara educational materials and child speech applications. The model was fine-tuned using NVIDIA NeMo and supports both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding.

🚨 Important Note

This model, along with its associated resources, is part of an ongoing research effort by the RobotsMali AI4D Lab. Users should be aware that:

* Early Childhood Performance Gap: While this model significantly reduces the baseline error rate on early childhood speech (<10 years) from 56% down to 29% WER, physiological features unique to young children (unformed acoustic profiles, erratic speech rates) continue to present an out-of-domain challenge compared to older cohorts.

* Structural Dependencies: The model performs exceptionally well on fluid, sequential storytelling text (achieving down to 7% WER) , but faces structural limitations on short, highly repetitive, sparse token arrays (e.g., Kuloriw or Jate) where language model prior biases dominate.

NVIDIA NeMo: Training

To fine-tune or run inference with this model, you will need to install NVIDIA NeMo. We recommend installing it alongside a compatible PyTorch environment.

pip install nemo-toolkit['asr']

How to Use This Model

Load Model with NeMo

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="RobotsMali/soloni-be-kalan-v0")

Transcribe Audio

# Accepts 16 kHz mono-channel audio wav files (no need to resample manually if your audio isn't 16kHz, the preprocessor will do)
asr_model.transcribe(['sample_child_reading.wav'])

If you encounter a RuntimeError: CUDA error: invalid argument due to GPU compatibility with CUDA Graphs in the TDT decoder, disable it in your configuration before transcribing:

decoding_cfg = asr_model.cfg.decoding
decoding_cfg.greedy.use_cuda_graph_decoder = False
asr_model.change_decoding_strategy(decoding_cfg=decoding_cfg)

Model Architecture

This model utilizes the Hybrid FastConformer-TDT-CTC architecture with 114 million parameters. FastConformer optimizes the standard Conformer model with 8x depthwise-separable convolutional downsampling. It features two independent but jointly trained decoders: an auto-regressive TDT decoder (default branch) and a convolutional decoder optimized via CTC loss.

Training & Fine-Tuning Configurations

Fine-tuning was executed under strict low-resource optimization constraints:

* Optimization Framework: Regularized using an Early Stopping mechanism with a 15-epoch patience window based on a validation set matching the benchmark distribution.

* Convergence Behavior: Due to high acoustic data density acting as a natural regularizer, training safely concluded at epoch 20, successfully preventing the vocabulary collapse and lexical overfitting typical of small, pristine speech corpora.

* Augmentation Strategy: This configuration explicitly omitted synthetic spectral masking (SpecAugment=None) , demonstrating that physical voice variance from natural human speakers acts as a superior regularizer than artificial noise injection in this specific domain.

Dataset

The model was fine-tuned on the combined Main + Duplicate expanded subset (totaling 45.6 hours) of the RobotsMali/an-be-kalan-bench dataset.

* Main Split (1.6h): Clean readings of 22 GAIFE project books recorded by 8 unique speakers.

* Duplicate Split (44h): A highly dense, multi-speaker redundant corpus featuring natural human speech variations (pitch, accent, child speech dynamics) reading the identical source literature.

Performance

Performance is disaggregated below across overall results, specific age cohorts, and distinctive book structures using Word Error Rate (WER) and Character Error Rate (CER).

Overall Evaluation

Model	Decoding Branch	WER (%) ↓	CER (%) ↓	Status
soloni-be-kalan	CTC	22.0%	8.0%	Newly deployed model in An be Kalan app
soloni-114m-tdt-ctc-v2 (Base)	CYC	42.0%	15.0%	Pre-trained Baseline Reference

Demographic Cohort Breakdown

Age Cohort	Utterance Count	Baseline WER (%)	Fine-Tuned WER (%)	Key Insights
Early Childhood (<10 yrs)	93	56.0%	29.0%	Remains the single largest acoustic error cluster.
Target Cohort (10-15 yrs)	527	—	22.0%	Majority representation; stable acoustic profiles.

License

This model is released under the CC-BY-4.0 license.

Downloads last month: 6

Model tree for RobotsMali/soloni-be-kalan-v0

Base model

nvidia/parakeet-tdt_ctc-110m

Finetuned

RobotsMali/soloni-114m-tdt-ctc-v0

Finetuned

RobotsMali/soloni-114m-tdt-ctc-v2

Finetuned

(2)

this model

Dataset used to train RobotsMali/soloni-be-kalan-v0

Evaluation results

Test WER on An be kalan Children's Reading Benchmark
test set self-reported

22.000
Test CER on An be kalan Children's Reading Benchmark
test set self-reported

8.000