Getting `Can't load the model for 'microsoft/speecht5_hifigan'.`

#1
by aliok-tr - opened

When I run the code in the README, I get this:

OSError: Can't load the model for 'microsoft/speecht5_hifigan'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'microsoft/speecht5_hifigan' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
$ huggingface-cli download microsoft/speecht5_hifigan
Fetching 4 files: 100%
...
/Users/aliok/.cache/huggingface/hub/models--microsoft--speecht5_hifigan/snapshots/bb6f429406e86a9992357a972c0698b22043307d

$ ls -lah /Users/aliok/.cache/huggingface/hub/models--microsoft--speecht5_hifigan/snapshots/bb6f429406e86a9992357a972c0698b22043307d            
total 0
drwxr-xr-x 6 aliok 192 Dec 26 01:41 ./
drwxr-xr-x 3 aliok  96 Dec 26 01:40 ../
lrwxr-xr-x 1 aliok  52 Dec 26 01:41 .gitattributes -> ../../blobs/c7d9f3332a950355d5a77d85000f05e6f45435ea
lrwxr-xr-x 1 aliok  52 Dec 26 01:41 README.md -> ../../blobs/2ad58df7fbd052cfbcac2975f625f3a3fc88cb64
lrwxr-xr-x 1 aliok  52 Dec 26 01:40 config.json -> ../../blobs/0a7082eeb84ebcfd0ae7cfd9e3ce5939dcbe39c4
lrwxr-xr-x 1 aliok  76 Dec 26 01:40 pytorch_model.bin -> ../../blobs/b171e9bcd8a2b50dc9780040478dfa26783a9ee4be012cf5776914f091d6887b

Any idea what's wrong?

FYI, even with the error, there's a speech.wav generated. However, it is not in good quality at all.

Attaching here:

FYI, even with the error, there's a speech.wav generated. However, it is not in good quality at all.

Attaching here:

The code works fine on Colab. The audio quality issue is likely because the training data included diverse voices (young/old adults, men/women), causing the model to produce mixed voice characteristics.
Actually this was an initial experiment focused primarily on testing how fast the model could learn Turkish characters. I'm currently developing an improved version with a more controlled dataset using a single voice type, which should address the quality concerns. I followed that basic tutorial to train that model.

Sign up or log in to comment