`Speech2Text.from_pretrained` not working with this model
hi
@eschmidbauer
it seems like your current model config is looking for training data that is not contained in here.
the other model works because it does have the data according to the yaml setup, for instance,
https://huggingface.co/pyf98/librispeech_100_e_branchformer/tree/main/exp/asr_stats_raw_en_bpe5000_sp/train
you might need that train data here.
Thanks for your question and discussion!
You can find the feats_stats.npz
file here:
https://huggingface.co/pyf98/librispeech_100_ctc_e_branchformer/tree/main/scratch/bbjs/peng6/espnet-public/egs2/librispeech_100/asr1/exp/asr_stats_raw_en_bpe5000_sp/train
It generally follows the same file structure, but due to some symbolic link issues, the absolute path was used for this model.
yes but on your yaml it's pointing to the wrong location right?
It pointed to a symbolic link in my machine. However, the packing script didn't resolve it properly.
BTW, the trained model does not need the shape files, which are used for batchifying before training. Instead, it requires the statistics for mean and var normalization:
https://huggingface.co/pyf98/librispeech_100_ctc_e_branchformer/blob/main/exp/asr_train_asr_ctc_e_branchformer_e12_raw_en_bpe5000_sp/config.yaml#L5158
cool! sorry I'm not sure If can help it further, did it work?
I updated this repo. Now it should work.
>>> import soundfile
>>> from espnet2.bin.asr_inference import Speech2Text
>>> m = Speech2Text.from_pretrained("pyf98/librispeech_100_ctc_e_branchformer")
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.48k/1.48k [00:00<00:00, 3.31MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 66.2k/66.2k [00:00<00:00, 1.72MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 326k/326k [00:00<00:00, 30.4MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.40k/1.40k [00:00<00:00, 2.61MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.83k/1.83k [00:00<00:00, 2.81MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 62.3k/62.3k [00:00<00:00, 1.58MB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16.1k/16.1k [00:00<00:00, 854kB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 41.6k/41.6k [00:00<00:00, 970kB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16.0k/16.0k [00:00<00:00, 32.4MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 25.0k/25.0k [00:00<00:00, 1.14MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37.0k/37.0k [00:00<00:00, 1.93MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 29.4k/29.4k [00:00<00:00, 1.40MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 31.1k/31.1k [00:00<00:00, 1.67MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 34.8k/34.8k [00:00<00:00, 1.62MB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 17.2k/17.2k [00:00<00:00, 795kB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35.7k/35.7k [00:00<00:00, 1.64MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37.9k/37.9k [00:00<00:00, 60.9MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 42.8k/42.8k [00:00<00:00, 1.14MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32.5k/32.5k [00:00<00:00, 46.5MB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16.2k/16.2k [00:00<00:00, 758kB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 106M/106M [00:00<00:00, 144MB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 338/338 [00:00<00:00, 492kB/s]
Fetching 22 files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 22/22 [00:05<00:00, 3.86it/s]
>>> m
<espnet2.bin.asr_inference.Speech2Text object at 0x7fa7281e0e80>
strange it's not working for me, but it's a different error
Python 3.9.16 (main, Jan 11 2023, 16:05:54)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
import soundfile
from espnet2.bin.asr_inference import Speech2Text
m = Speech2Text.from_pretrained("pyf98/librispeech_100_ctc_e_branchformer")
Downloading (β¦)"feats_stats.npz";: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.40k/1.40k [00:00<00:00, 194kB/s]
Downloading (β¦)4df38/.gitattributes: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.48k/1.48k [00:00<00:00, 91.5kB/s]
Downloading (β¦)pe5000_sp/RESULTS.md: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.83k/1.83k [00:00<00:00, 211kB/s]
Downloading (β¦)00_sp/images/acc.png: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16.1k/16.1k [00:00<00:00, 1.70MB/s]
Downloading (β¦)es/backward_time.png: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 41.6k/41.6k [00:00<00:00, 6.46MB/s]
Downloading (β¦)e5000_sp/config.yaml: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 62.3k/62.3k [00:00<00:00, 1.37MB/s]
Downloading (β¦)9efaa4df38/README.md: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 66.2k/66.2k [00:00<00:00, 1.66MB/s]
Downloading (β¦)"bpe.model";: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 326k/326k [00:00<00:00, 3.17MB/s]
Downloading (β¦)p/images/cer_ctc.png: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 25.0k/25.0k [00:00<00:00, 9.18MB/s]
Downloading (β¦)ges/forward_time.png: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37.0k/37.0k [00:00<00:00, 562kB/s]
Downloading (β¦)ax_cached_mem_GB.png: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 29.4k/29.4k [00:00<00:00, 460kB/s]
Downloading (β¦)00_sp/images/cer.png: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16.0k/16.0k [00:00<00:00, 212kB/s]
Downloading (β¦)images/iter_time.png: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 31.1k/31.1k [00:00<00:00, 483kB/s]
Downloading (β¦)0_sp/images/loss.png: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 34.8k/34.8k [00:00<00:00, 545kB/s]
Downloading (β¦)/images/loss_att.png: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 17.2k/17.2k [00:00<00:00, 5.76MB/s]
Downloading (β¦)/images/loss_ctc.png: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35.7k/35.7k [00:00<00:00, 935kB/s]
Downloading (β¦)00_sp/images/wer.png: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16.2k/16.2k [00:00<00:00, 5.96MB/s]
Downloading (β¦)/optim_step_time.png: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 42.8k/42.8k [00:00<00:00, 807kB/s]
Downloading (β¦)mages/train_time.png: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32.5k/32.5k [00:00<00:00, 1.02MB/s]
Downloading (β¦)mages/optim0_lr0.png: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37.9k/37.9k [00:00<00:00, 1.24MB/s]
Downloading (β¦)9efaa4df38/meta.yaml: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 338/338 [00:00<00:00, 155kB/s]
Downloading (β¦)ctc.ave_10best.pth";: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 106M/106M [00:02<00:00, 40.7MB/s]
Fetching 22 files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 22/22 [00:03<00:00, 6.19it/s]
Traceback (most recent call last):;: 99%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 105M/106M [00:02<00:00, 38.9MB/s]
File "", line 1, in
File "/root/espnet/espnet2/bin/asr_inference.py", line 516, in from_pretrained
return Speech2Text(**kwargs)
File "/root/espnet/espnet2/bin/asr_inference.py", line 113, in init
asr_model, asr_train_args = task.build_model_from_file(
File "/root/espnet/espnet2/tasks/abs_task.py", line 1857, in build_model_from_file
model.load_state_dict(torch.load(model_file, map_location=device))
File "/root/miniconda3/envs/espnet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ESPnetASRModel:
Missing key(s) in state_dict: "decoder.embed.weight", "decoder.decoder.0.weight_ih", "decoder.decoder.0.weight_hh", "decoder.decoder.0.bias_ih", "decoder.decoder.0.bias_hh", "decoder.output.weight", "decoder.output.bias", "decoder.att_list.0.mlp_enc.weight", "decoder.att_list.0.mlp_enc.bias", "decoder.att_list.0.mlp_dec.weight", "decoder.att_list.0.mlp_att.weight", "decoder.att_list.0.loc_conv.weight", "decoder.att_list.0.gvec.weight", "decoder.att_list.0.gvec.bias".
`