No maximum length is provided
#1
by
bilelomrani
- opened
The truncation=True
parameter with camembert-large's tokenizer does not seem to have any effect. When running this example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("camembert/camembert-large")
tokenizer(["Some long piece of text", "Some other long piece of text"], padding=True, truncation=True, return_tensors="pt")
the following warning is issued
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The inference thus causes an exception on long sentences because the tokenizer fails to truncate the input to 512 tokens.
Agreed, it would be better if the tokenizer config included 512
as the limit
however, @bilelomrani , a workaround is to set it yourself after loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained("camembert/camembert-large")
tokenizer.model_max_length = 512
Just uploaded the tokenizer_config.json th at fixes this. Thank you
wissamantoun
changed discussion status to
closed
Excellent, thank you!
Pleasantly surprising, actually, many groups just throw their models into HF and forget them forever