For fine-tuning deplot, what form of text should be given as input data table?

by sinchir0 - opened May 11, 2023

May 11, 2023

•

edited May 11, 2023

I am trying to do fine-tuning google/deplot according to the link and Notebook below.

link: https://huggingface.co/docs/transformers/main/en/model_doc/deplot#finetuning
Notebook: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb

What form of text should be given as input data table for fine-tuning deplot?
From the figure 1 of paper, I think that the text format is as follows

text = """
Header: models | augmented-set | human-set
Row 1: VisionTapas |67.2 | 22.2
Row 2: Pix2Struct |82.9 | 30.4
"""

Is this correct?

On the fine-tuning notebook, I think that the above data will be placed in texts of the following code.(in collator functions)

 text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)

fl399

Google org May 11, 2023

Hi, thanks for the questions.

We used the following format for groudtruth:

models | augmented-set | human-set 
VisionTapas | 67.2 | 22.2
Pix2Struct | 82.9 | 30.4

The Header: and Row x: labels are added using a post-processing function (see here). Or you can see this as an example groudtruth table.

And yes the groudtruths should be put into texts in the collator function.

Hope this helps!

fl399 changed discussion status to closed May 12, 2023

sinchir0

May 14, 2023

@fl399
I understand, thank you so much!

DoctorSlimm

Aug 13, 2023

@fl399 Hi! In regards Fine Tuning and The Text PreProcessing, Are \n newline character by default converted into <0x0A>, or do we need to do this ourselves before passing the text into the tokenizer?

kronosprime

Aug 18, 2023

@fl399 or @sinchir0
Please could you share the changes needed for image_captioning_pix2struct.ipynb? I've spent the better part of an afternoon / evening working with Colab on solving this and allowing it to fine tune. I've shared the paper, the notebook, and this discussion with Anthropic's Claude and tried with Google's Colab LLM but I can't get it working. Please share with us what we need to change in the cells of the notebook, specifically the ImageCaptioningDataset class, the collator, and anything else? I, and many others, would be so grateful for this, please 🙏🙏

kronosprime

Aug 21, 2023

The challenge is that we're using a different processor, initialized with:
processor = Pix2StructProcessor.from_pretrained("google/deplot")

If we reference the model card fine-tuning section
we see the example for using the processor as:
inputs = processor(images=images, text="Generate underlying data table of the figure below:", return_tensors="pt")

And we are pointed to the image_captioning_pix2struct notebook

The code below is from the original notebook where "text" is essentially the text-based label/answer for the "image", but can we get the remaining updates so at least the example will work? I have fiddled with it until it is training but left with a lot of regret and uncertainty that it's making progress. I'd like to know the code is implemented correctly before dedicating an A100 for a few hours on the job.

from torch.utils.data import Dataset, DataLoader

MAX_PATCHES = 1024

class ImageCaptioningDataset(Dataset):
    def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        encoding = self.processor(images=item["image"], return_tensors="pt", add_special_tokens=True, max_patches=MAX_PATCHES)
        
        encoding = {k:v.squeeze() for k,v in encoding.items()}
        encoding["text"] = item["text"]
        return encoding

def collator(batch):
  new_batch = {"flattened_patches":[], "attention_mask":[]}
  texts = [item["text"] for item in batch]
  
  text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
  
  new_batch["labels"] = text_inputs.input_ids
  
  for item in batch:
    new_batch["flattened_patches"].append(item["flattened_patches"])
    new_batch["attention_mask"].append(item["attention_mask"])
  
  new_batch["flattened_patches"] = torch.stack(new_batch["flattened_patches"])
  new_batch["attention_mask"] = torch.stack(new_batch["attention_mask"])

  return new_batch

kronosprime

Aug 30, 2023

Anyone?

kronosprime

Sep 17, 2023

Anyone?

SungBeom

Oct 11, 2023

You need to change below line
--------------------------- before ------------------------------------
text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
--------------------------- after ------------------------------------
text_inputs = processor.tokenizer(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)

kronosprime

Oct 24, 2023

thank you @SungBeom 🙏

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment