microsoft/Florence-2-base-ft · Finetuning Token Length Limit

Hi,

I've been trying to follow the fine tuning notebooks below and I'm getting stuck on the token length issues.
https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-finetune-florence-2-on-detection-dataset.ipynb?ref=blog.roboflow.com#scrollTo=zqDWEWDcaSxN
https://colab.research.google.com/drive/1Y8GVjwzBIgfmfD3ZypDX5H1JA_VG0YDL?usp=sharing
https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing

I am typically getting token lengths in the 1040-1200 range that throw an error during training.

Training Epoch 1/1:   0%|          | 0/10 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1038 > 1024). Running this sequence through the model will result in indexing errors
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [174,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.

I was initially trying to use two datasets with images that are capped at 960px and 1600px (which can in reality be up to 4000px before scaling). The reason for the large size is because these are construction drawings, which I ideally need to avoid segmenting them to keep proper context of certain objects in the same frame. The larger ones have a considerable amount more of annotations (up to 150 / frame), so I tried experimenting with the smaller image dataset which has a maximum of 40 annotations in a single frame. I tried resizing them down from 960px to 320px max, which didn't seem to do very much. I tried to cap the number of annotations per frame, and the model was able to fine tune at least.

Are there suggestions as to how I can get around the 1024 token length maximum? Would it be technically sound to have multiple copies of the images in the datasets, with each copy only have annotations for a single class to reduce the input token length? I fear that this would cause issues having multiple copies of the same image, and having the other trained classes not labeled in that frame. Also this isn’t like a COCO/LVIS structure where I can pass NOT_EXHAUSTIVE_LABELS.

I actually wanted to not only train OD but also train deeper descriptions for each class so the model can understand the construction/engineering context for these novel classes (similar to this question: https://huggingface.co/microsoft/Florence-2-large/discussions/32). I initially tried to add two types of annotations in the jsonl one with the OD prefix and then another line with the DENSE_REGION_CAPTION for each image. However, if I can't even get the number of annotations to work for the OD, this definitely won't work. Any suggestions?