Default padding_side

#48
by Cyrile - opened

Hello, I've observed a behavior that could be troublesome. Referring to the code of the BloomForSequenceClassification class, the token in question works correctly when the padding strategy is set to the right. However, for Bloomz-560m, Bloomz-7b1, and Bloomz, the default strategy appears to be set to the left... which could lead to unintended behaviors. Wouldn't it be desirable to set the padding strategy to the right by default for all models?

Cyrile changed discussion status to closed
BigScience Workshop org

did you find the information in the end @Cyrile ?

Hello Julien-c, thank you for your interest. I was referring to the part: "padding_side":"left" set as the default in the tokenizer_config.json file. I was just cautioning about this choice and the incompatibility of this strategy with the implementation of the BloomForSequenceClassification class, which seems to be programmed for a padding_side strategy of right. The solution is to carefully consider placing the padding on the correct side for classification. However, I'm concerned that this choice might mislead less experienced individuals in this type of modeling or with the Transformers library...

class BloomForSequenceClassification(BloomPreTrainedModel):
    [...]
    def forward(...):
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device) # <- this is ok for padding_side = 'right' strategy ?
            else:
                sequence_lengths = -1
                logger.warning(
                    f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
                    "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
                )

        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
        [...]

I have one last question: is there a reason why you chose to use the padding ID to position the last token rather than the sum on the attention mask?

Does this mean that when finetuning bigscience/bloom-560m, it's expected to use padding_side = "right", but when finetuning bigscience/bloomz-560m, padding_side = "left" should be used?

I'm seeing some inconsistency between how bloom-560m and bloomz-560m converge during finetuning and I suspect this might be the cause.

Sign up or log in to comment