Spiece.Model

by flexudy - opened Oct 28, 2022

Oct 28, 2022

I try to load the tokenizer on AWS lambda but I get this error.
module initialization error: Internal: /sentencepiece/python/bundled/sentencepiece/src/sentencepiece_processor.cc(848) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Any idea?

Locally it works but for some reason, not on lambda.
When I upload a spiece.model file in the model folder (from another model just to see if it works) then it works fine, but the predictions are garbage.

yhavinga

Owner Oct 28, 2022

Hey @flexudy

Are you loading the tokenizer with AutoTokenizer.from_pretrained() ? And is the tokenizers package recent?

The (sentencepiece) tokenizer of t5-base-dutch was created with HF tools instead of the 'official' sentencepiece tokenizer. One difference is that the latter creates 'spiece.model', which is absent from the tokenizers created by HF tools, that only create tokenizer.json. A while ago I also got cryptic errors when loading the HF-created tokenizers that worked without issues a few months earlier. In the end I could solve these problems by either upgrading the tokenizers package or downgrading if I was at the latest version. Lately I haven't had any issues anymore, so I suspect recent releases tokenizers are subjected to more rigorous integration tests.

flexudy

Oct 28, 2022

hey @yhavinga

Thanks for the quick response.
I am loading the tokenizer using T5TokenizerFast. I currently use transformers 4.18.0. Also tried everything between 4.9 and 4.23.
On MacOS, everything is fine. But not on AWS lambda.

I thought you might have some clues about why this error would happen.

yhavinga

Owner Oct 28, 2022

What does pip freeze | grep tokenizers say? I just checked in two environments and it works with 0.12.1 and 0.13.1.
Also, are there perhaps lingering tokenizer files in the working directory of the script? I had a bug once that the tokenizer load would load from the current directory in stead of the passed model id on the HF hub.

flexudy changed discussion status to closed Oct 28, 2022

flexudy

Oct 28, 2022

@yhavinga I found the error. The tokenizer.json file was not packaged properly.
Thank you very much.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment