Pruned Stateless Zipformer RNN-T Streaming ID
Pruned Stateless Zipformer RNN-T Streaming ID is an automatic speech recognition model trained on the following datasets:
Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. ['p', 'ə', 'r', 'b', 'u', 'a', 't', 'a', 'n', 'ɲ', 'a']
. Therefore, the model's vocabulary contains the different IPA phonemes found in g2p ID.
This model was trained using icefall framework. All training was done on a Scaleway RENDER-S VM with a Tesla P100 GPU. All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.
Evaluation Results
Simulated Streaming
for m in greedy_search fast_beam_search modified_beam_search; do
./pruned_transducer_stateless7_streaming/decode.py \
--epoch 30 \
--avg 9 \
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method $m
done
The model achieves the following phoneme error rates on the different test sets:
Decoding | LibriVox | FLEURS | Common Voice |
---|---|---|---|
Greedy Search | 4.87% | 11.45% | 14.97% |
Modified Beam Search | 4.71% | 11.25% | 14.31% |
Fast Beam Search | 4.85% | 12.55% | 14.89% |
Chunk-wise Streaming
for m in greedy_search fast_beam_search modified_beam_search; do
./pruned_transducer_stateless7_streaming/streaming_decode.py \
--epoch 30 \
--avg 9 \
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
--decoding-method $m \
--decode-chunk-len 32 \
--num-decode-streams 1500
done
The model achieves the following phoneme error rates on the different test sets:
Decoding | LibriVox | FLEURS | Common Voice |
---|---|---|---|
Greedy Search | 5.12% | 12.74% | 15.78% |
Modified Beam Search | 4.78% | 11.83% | 14.54% |
Fast Beam Search | 4.81% | 12.93% | 14.96% |
Usage
Download Pre-trained Model
cd egs/bookbot/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id
Inference
To decode with greedy search, run:
./pruned_transducer_stateless7_streaming/jit_pretrained.py \
--nn-model-filename ./tmp/pruned-transducer-stateless7-streaming-id/exp/cpu_jit.pt \
--lang-dir ./tmp/pruned-transducer-stateless7-streaming-id/data/lang_phone \
./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav
Decoding Output
2023-06-21 10:19:18,563 INFO [jit_pretrained.py:217] device: cpu
2023-06-21 10:19:19,231 INFO [lexicon.py:168] Loading pre-compiled tmp/pruned-transducer-stateless7-streaming-id/data/lang_phone/Linv.pt
2023-06-21 10:19:19,232 INFO [jit_pretrained.py:228] Constructing Fbank computer
2023-06-21 10:19:19,233 INFO [jit_pretrained.py:238] Reading sound files: ['./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav']
2023-06-21 10:19:19,234 INFO [jit_pretrained.py:244] Decoding started
2023-06-21 10:19:20,090 INFO [jit_pretrained.py:271]
./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav:
p u l a ŋ | s ə k o l a h | p i t ə r i | s a ŋ a t | l a p a r
2023-06-21 10:19:20,090 INFO [jit_pretrained.py:273] Decoding Done
Training procedure
Install icefall
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
Prepare Data
cd egs/bookbot_id/ASR
./prepare.sh
Train
export CUDA_VISIBLE_DEVICES="0"
./pruned_transducer_stateless7_streaming/train.py \
--num-epochs 30 \
--use-fp16 1 \
--max-duration 400