No maximum length is provided

#1
by bilelomrani - opened

The truncation=True parameter with camembert-large's tokenizer does not seem to have any effect. When running this example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("camembert/camembert-large")
tokenizer(["Some long piece of text", "Some other long piece of text"], padding=True, truncation=True, return_tensors="pt")

the following warning is issued

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

The inference thus causes an exception on long sentences because the tokenizer fails to truncate the input to 512 tokens.

Agreed, it would be better if the tokenizer config included 512 as the limit

however, @bilelomrani , a workaround is to set it yourself after loading the tokenizer

tokenizer = AutoTokenizer.from_pretrained("camembert/camembert-large")
tokenizer.model_max_length = 512

Sign up or log in to comment