Fine Tuning Transfomers

#5
by RicoRausch - opened

https://huggingface.co/blog/paligemma
Hey everyone, maybe a silly question, but shouldn't the tokens for the answer be part of the input_ids? I'm trying to understand why the answer tokens are not included in the input_ids, can someone explain this to me?

image_token = processor.tokenizer.convert_tokens_to_ids("")
def collate_fn(examples):
texts = ["answer " + example["question"] + "\n" + example['multiple_choice_answer'] for example in examples]
images = [example["image"].convert("RGB") for example in examples]
tokens = processor(text=texts, images=images,
return_tensors="pt", padding="longest",
tokenize_newline_separately=False)
labels = tokens["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == image_token] = -100
tokens["labels"] = labels
tokens = tokens.to(torch.bfloat16).to(device)
return tokens

RicoRausch changed discussion status to closed
RicoRausch changed discussion status to open

Hi,

PaliGemma requires the labels to be passed using the "suffix" keyword argument.

See also the demo notebooks here: https://huggingface.co/docs/transformers/main/en/model_doc/paligemma#resources

RicoRausch changed discussion status to closed

Sign up or log in to comment