Instructions to use Reza2kn/visualears-fastconformer-fa-full-ab with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Reza2kn/visualears-fastconformer-fa-full-ab with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Reza2kn/visualears-fastconformer-fa-full-ab") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
VisualEars FastConformer Persian ASR Full A+B
Persian/Farsi ASR fine-tune for the small/fast VisualEars model, trained from nvidia/stt_fa_fastconformer_hybrid_large on the full A+B training mix.
Main Checkpoint
fa_fastconformer_ab_final.nemo: final NeMo FastConformer hybrid RNNT/CTC checkpoint from the full A+B run.
Runtime Exports
Canonical runtime exports live in separate derivative model repos so Hugging Face can attach them to this fine-tune as quantized/export variants:
| Repo | Format | Validation |
|---|---|---|
visualears-fastconformer-fa-full-ab-onnx-fp |
ONNX FP fixed CTC core | 100.00% CTC argmax parity |
visualears-fastconformer-fa-full-ab-onnx-w4 |
ONNX Runtime weight-only 4-bit, asymmetric block-32 | 98.61% CTC argmax parity |
visualears-fastconformer-fa-full-ab-coreml-fp16 |
CoreML FP16 fixed CTC core | 99.85% CTC argmax parity |
visualears-fastconformer-fa-full-ab-coreml-w4 |
CoreML 4-bit k-means palettized, compressed variant | 98.06% CTC argmax parity |
visualears-fastconformer-fa-full-ab-coreml-w4-quality |
CoreML 4-bit k-means palettized, quality-first variant | 99.65% CTC argmax parity |
visualears-fastconformer-fa-full-ab-litert-fp |
LiteRT/TFLite FP fixed CTC core | 100.00% CTC argmax parity; 100.00% transcript parity on 16 calibration items |
visualears-fastconformer-fa-full-ab-litert-w4 |
LiteRT/TFLite selected fully-connected weight-only 4-bit | 98.23% frame CTC argmax parity; failed transcript parity at 37.5% on 16 calibration items |
visualears-fastconformer-fa-full-ab-fp16 |
NeMo FP16 reduced-precision checkpoint | 98.0% exact transcript match vs FP base on 200 FLEURS-fa eval clips |
visualears-fastconformer-fa-full-ab-fp8 |
NeMo FP8 via NVIDIA ModelOpt | 18.48% WER / 6.69% CER on 200 FLEURS-fa eval clips; 99.47% WER retention vs FP base |
visualears-fastconformer-fa-full-ab-nvfp4 |
NeMo NVFP4 W4A4 via NVIDIA ModelOpt | 20.33% WER / 7.38% CER on 200 FLEURS-fa eval clips |
The export repos are fixed-frame acoustic CTC-core artifacts. They take precomputed log-mel features as processed_signal; they are not full raw-audio-to-text pipelines by themselves.
Training Snapshot
- Train manifest: 6,231,918 rows
- Validation manifest: 31,424 rows
- Final train step: 48,687
- NeMo architecture: FastConformer hybrid RNNT/CTC
Benchmarks
External benchmark snapshot from June 10, 2026:
| Decoder | Golha gold-69 WER | FLEURS fa WER | FLEURS fa CER |
|---|---|---|---|
| RNNT greedy | 25.29 | 15.73 | 5.25 |
| CTC + 4-gram LM, alpha=0.2 beta=-1.0 beam=50 | 25.96 | 13.60 | 5.39 |
The LM setting was calibrated on a FLEURS-256 slice and helped FLEURS WER, but did not improve Golha in this snapshot.
Notes
This is a research checkpoint. Normalization and tokenization choices matter for reported WER/CER.
- Downloads last month
- -
Model tree for Reza2kn/visualears-fastconformer-fa-full-ab
Base model
nvidia/stt_fa_fastconformer_hybrid_large