Spiece.Model
Howdy @yhavinga
I try to load the tokenizer on AWS lambda but I get this error.
module initialization error: Internal: /sentencepiece/python/bundled/sentencepiece/src/sentencepiece_processor.cc(848) [model_proto->ParseFromArray(serialized.data(), serialized.size())]
Any idea?
Locally it works but for some reason, not on lambda.
When I upload a spiece.model file in the model folder (from another model just to see if it works) then it works fine, but the predictions are garbage.
Hey @flexudy
Are you loading the tokenizer with AutoTokenizer.from_pretrained() ? And is the tokenizers
package recent?
The (sentencepiece) tokenizer of t5-base-dutch was created with HF tools instead of the 'official' sentencepiece tokenizer. One difference is that the latter creates 'spiece.model', which is absent from the tokenizers created by HF tools, that only create tokenizer.json
. A while ago I also got cryptic errors when loading the HF-created tokenizers that worked without issues a few months earlier. In the end I could solve these problems by either upgrading the tokenizers package or downgrading if I was at the latest version. Lately I haven't had any issues anymore, so I suspect recent releases tokenizers
are subjected to more rigorous integration tests.
hey @yhavinga
Thanks for the quick response.
I am loading the tokenizer using T5TokenizerFast. I currently use transformers 4.18.0. Also tried everything between 4.9 and 4.23.
On MacOS, everything is fine. But not on AWS lambda.
I thought you might have some clues about why this error would happen.
What does pip freeze | grep tokenizers
say? I just checked in two environments and it works with 0.12.1 and 0.13.1.
Also, are there perhaps lingering tokenizer files in the working directory of the script? I had a bug once that the tokenizer load would load from the current directory in stead of the passed model id on the HF hub.