almanach/camembert-large · No maximum length is provided

Oct 23, 2022

The truncation=True parameter with camembert-large's tokenizer does not seem to have any effect. When running this example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("camembert/camembert-large")
tokenizer(["Some long piece of text", "Some other long piece of text"], padding=True, truncation=True, return_tensors="pt")

the following warning is issued

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

The inference thus causes an exception on long sentences because the tokenizer fails to truncate the input to 512 tokens.

AngledLuffa

May 17, 2023

Agreed, it would be better if the tokenizer config included 512 as the limit

AngledLuffa

May 17, 2023

however, @bilelomrani , a workaround is to set it yourself after loading the tokenizer

tokenizer = AutoTokenizer.from_pretrained("camembert/camembert-large")
tokenizer.model_max_length = 512

wissamantoun

ALMAnaCH (Inria) org Jul 21

Just uploaded the tokenizer_config.json th at fixes this. Thank you

wissamantoun changed discussion status to closed Jul 21

AngledLuffa

Jul 21

Excellent, thank you!

Pleasantly surprising, actually, many groups just throw their models into HF and forget them forever