For fine-tuning deplot, what form of text should be given as input data table?

#3
by sinchir0 - opened

I am trying to do fine-tuning google/deplot according to the link and Notebook below.

link: https://huggingface.co/docs/transformers/main/en/model_doc/deplot#finetuning
Notebook: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb

What form of text should be given as input data table for fine-tuning deplot?
From the figure 1 of paper, I think that the text format is as follows

text = """
Header: models | augmented-set | human-set
Row 1: VisionTapas |67.2 | 22.2
Row 2: Pix2Struct |82.9 | 30.4
"""

Is this correct?

On the fine-tuning notebook, I think that the above data will be placed in texts of the following code.(in collator functions)

 text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
Google org

Hi, thanks for the questions.

We used the following format for groudtruth:

models | augmented-set | human-set 
VisionTapas | 67.2 | 22.2
Pix2Struct | 82.9 | 30.4

The Header: and Row x: labels are added using a post-processing function (see here). Or you can see this as an example groudtruth table.

And yes the groudtruths should be put into texts in the collator function.

Hope this helps!

fl399 changed discussion status to closed

@fl399
I understand, thank you so much!

@fl399 Hi! In regards Fine Tuning and The Text PreProcessing, Are \n newline character by default converted into <0x0A>, or do we need to do this ourselves before passing the text into the tokenizer?

@fl399 or @sinchir0
Please could you share the changes needed for image_captioning_pix2struct.ipynb? I've spent the better part of an afternoon / evening working with Colab on solving this and allowing it to fine tune. I've shared the paper, the notebook, and this discussion with Anthropic's Claude and tried with Google's Colab LLM but I can't get it working. Please share with us what we need to change in the cells of the notebook, specifically the ImageCaptioningDataset class, the collator, and anything else? I, and many others, would be so grateful for this, please πŸ™πŸ™

The challenge is that we're using a different processor, initialized with:
processor = Pix2StructProcessor.from_pretrained("google/deplot")

If we reference the model card fine-tuning section
we see the example for using the processor as:
inputs = processor(images=images, text="Generate underlying data table of the figure below:", return_tensors="pt")

And we are pointed to the image_captioning_pix2struct notebook

The code below is from the original notebook where "text" is essentially the text-based label/answer for the "image", but can we get the remaining updates so at least the example will work? I have fiddled with it until it is training but left with a lot of regret and uncertainty that it's making progress. I'd like to know the code is implemented correctly before dedicating an A100 for a few hours on the job.

from torch.utils.data import Dataset, DataLoader

MAX_PATCHES = 1024

class ImageCaptioningDataset(Dataset):
    def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        encoding = self.processor(images=item["image"], return_tensors="pt", add_special_tokens=True, max_patches=MAX_PATCHES)
        
        encoding = {k:v.squeeze() for k,v in encoding.items()}
        encoding["text"] = item["text"]
        return encoding
def collator(batch):
  new_batch = {"flattened_patches":[], "attention_mask":[]}
  texts = [item["text"] for item in batch]
  
  text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
  
  new_batch["labels"] = text_inputs.input_ids
  
  for item in batch:
    new_batch["flattened_patches"].append(item["flattened_patches"])
    new_batch["attention_mask"].append(item["attention_mask"])
  
  new_batch["flattened_patches"] = torch.stack(new_batch["flattened_patches"])
  new_batch["attention_mask"] = torch.stack(new_batch["attention_mask"])

  return new_batch

You need to change below line
--------------------------- before ------------------------------------
text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
--------------------------- after ------------------------------------
text_inputs = processor.tokenizer(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)

thank you @SungBeom πŸ™

Sign up or log in to comment