Instructions to use Kim-el/failed-swapped-to-malay-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Kim-el/failed-swapped-to-malay-tokenizer with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Kim-el/failed-swapped-to-malay-tokenizer") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
β οΈ FAILED TRAINING: Swapped Malay Tokenizer ASR Model
This repository contains a failed training run of fine-tuning nvidia/parakeet-tdt-0.6b-v3 on the Malay BabelSpeech dataset.
π΄ Failure Reason: Token Mapping Collapse
During initialization, the pre-trained English/multilingual SentencePiece tokenizer was swapped with a custom Malay SentencePiece Tokenizer using NeMo's change_vocabulary() method.
This action completely reset and randomly re-initialized the weights of the decoder and joint network. Because the training corpus was relatively small (50 hours of audio), the randomly initialized decoder could not align with the frozen pre-trained encoder features.
Consequently, the model collapsed during the second stage of fine-tuning (when the encoder was unfrozen) and got trapped in a local minimum, repeating the most common subword tokens indefinitely:
- Validation WER: 109.5%
- Output collapse pattern:
'so dia dia dia dia dia dia dia...'
π‘ Recommendation
For speech corpuses smaller than 1000 hours, do not swap the pre-trained SentencePiece tokenizer vocabulary. Instead, keep the standard multilingual/English pre-trained tokenizer and fine-tune it directly. This preserves the pre-trained alignment weights and avoids representation collapse, as demonstrated in our successful run: Kim-el/parakeet-0.6-tdt-malay-english-vocab.
- Downloads last month
- -