Model's pad_token_id is inconsistent with Tokenizer's pad_token_id

#6
by aebogdanova - opened

Hi.

When fine-tuning ai-forever/ruRoberta-large for a classification task, I encountered an issue where the model's outputs change with increasing number of padding tokens when padding_side of the tokenizer is set to "left". This behavior was not observed when padding_side was set to "right". I found that the reason is that the model's pad_token_id is not consistent with the tokenizer's pad_token_id.

This discrepancy can be demonstrated with the following code snippet. The outputs are different for max_length=10 and max_length=15.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "ai-forever/ruRoberta-large"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

print(f"Tokenizer's pad_token_id: {tokenizer.pad_token_id}")
print(f"Model's pad_token_id: {model.config.pad_token_id}")

text = "Привет!"

# Tokenize with max_length=10
inputs_10 = tokenizer(text, padding='max_length', max_length=10, return_tensors="pt")
outputs_10 = model(**inputs_10)
print(outputs_10.logits)

# Tokenize with max_length=15
inputs_15 = tokenizer(text, padding='max_length', max_length=15, return_tensors="pt")
outputs_15 = model(**inputs_15)
print(outputs_15.logits)

The output:

Tokenizer's pad_token_id: 0
Model's pad_token_id: 1
tensor([[-0.2289, -0.3192]], grad_fn=<AddmmBackward0>)
tensor([[-0.3088, -0.4159]], grad_fn=<AddmmBackward0>)

The issue can be resolved by explicitly setting the model's pad_token_id to match the tokenizer's pad_token_id. The corrected code is shown below:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, RobertaConfig

checkpoint = "ai-forever/ruRoberta-large"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Update the model's pad_token_id to match the tokenizer's pad_token_id
model_config = model.config.to_dict()
model_config["pad_token_id"] = tokenizer.pad_token_id
config = RobertaConfig(**model_config)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, config=config)

print(f"Tokenizer's pad_token_id: {tokenizer.pad_token_id}")
print(f"Model's pad_token_id: {model.config.pad_token_id}")

text = "Привет!"

# Tokenize with max_length=10
inputs_10 = tokenizer(text, padding='max_length', max_length=10, return_tensors="pt")
outputs_10 = model(**inputs_10)
print(outputs_10.logits)

# Tokenize with max_length=15
inputs_15 = tokenizer(text, padding='max_length', max_length=15, return_tensors="pt")
outputs_15 = model(**inputs_15)
print(outputs_15.logits)

The output:

Tokenizer's pad_token_id: 0
Model's pad_token_id: 0
tensor([[-0.2559, -0.5355]], grad_fn=<AddmmBackward0>)
tensor([[-0.2559, -0.5355]], grad_fn=<AddmmBackward0>)

The reason lies in the fact that switching padding indices influences the positional encodings, especially when padding_side is set to left. For padding_side="right", the position indices for non-pad tokens remain the same regardless of the padding. However, for padding_side="left", the position indices for non-pad tokens differ due to the shift caused by left-padding with incorrectly processed pad token.

I am not sure if this discrepancy might affect the training process. Any insights on this potential issue would be appreciated.

Sign up or log in to comment