Model's pad_token_id is inconsistent with Tokenizer's pad_token_id
Hi.
When fine-tuning ai-forever/ruRoberta-large
for a classification task, I encountered an issue where the model's outputs change with increasing number of padding tokens when padding_side
of the tokenizer is set to "left". This behavior was not observed when padding_side
was set to "right". I found that the reason is that the model's pad_token_id
is not consistent with the tokenizer's pad_token_id
.
This discrepancy can be demonstrated with the following code snippet. The outputs are different for max_length=10
and max_length=15
.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "ai-forever/ruRoberta-large"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
print(f"Tokenizer's pad_token_id: {tokenizer.pad_token_id}")
print(f"Model's pad_token_id: {model.config.pad_token_id}")
text = "Привет!"
# Tokenize with max_length=10
inputs_10 = tokenizer(text, padding='max_length', max_length=10, return_tensors="pt")
outputs_10 = model(**inputs_10)
print(outputs_10.logits)
# Tokenize with max_length=15
inputs_15 = tokenizer(text, padding='max_length', max_length=15, return_tensors="pt")
outputs_15 = model(**inputs_15)
print(outputs_15.logits)
The output:
Tokenizer's pad_token_id: 0
Model's pad_token_id: 1
tensor([[-0.2289, -0.3192]], grad_fn=<AddmmBackward0>)
tensor([[-0.3088, -0.4159]], grad_fn=<AddmmBackward0>)
The issue can be resolved by explicitly setting the model's pad_token_id
to match the tokenizer's pad_token_id
. The corrected code is shown below:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, RobertaConfig
checkpoint = "ai-forever/ruRoberta-large"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# Update the model's pad_token_id to match the tokenizer's pad_token_id
model_config = model.config.to_dict()
model_config["pad_token_id"] = tokenizer.pad_token_id
config = RobertaConfig(**model_config)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, config=config)
print(f"Tokenizer's pad_token_id: {tokenizer.pad_token_id}")
print(f"Model's pad_token_id: {model.config.pad_token_id}")
text = "Привет!"
# Tokenize with max_length=10
inputs_10 = tokenizer(text, padding='max_length', max_length=10, return_tensors="pt")
outputs_10 = model(**inputs_10)
print(outputs_10.logits)
# Tokenize with max_length=15
inputs_15 = tokenizer(text, padding='max_length', max_length=15, return_tensors="pt")
outputs_15 = model(**inputs_15)
print(outputs_15.logits)
The output:
Tokenizer's pad_token_id: 0
Model's pad_token_id: 0
tensor([[-0.2559, -0.5355]], grad_fn=<AddmmBackward0>)
tensor([[-0.2559, -0.5355]], grad_fn=<AddmmBackward0>)
The reason lies in the fact that switching padding indices influences the positional encodings, especially when padding_side
is set to left. For padding_side="right"
, the position indices for non-pad tokens remain the same regardless of the padding. However, for padding_side="left"
, the position indices for non-pad tokens differ due to the shift caused by left-padding with incorrectly processed pad token.
I am not sure if this discrepancy might affect the training process. Any insights on this potential issue would be appreciated.