Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers

What is the minimum number of examples to fine-tune IDEFICS-2 to get the expected output?

#14
by marksuccsmfewercoc - opened

What is the minimum number of examples to fine-tune IDEFICS-2 to get the expected output? Should I fine-tune this to 100 examples or more? Let's suppose the task is to take a prompt "How many languages can you see in the images?" and get the answer in this format like { languages: ['English', 'Chinese', 'French']}

HuggingFaceM4 org

Hi @marksuccsmfewercoc , it is a hard question to answer in a general way. The minimum number of samples you will need to get decent performance is very task specific... i would encourage you to try experimentally. Intuitively, the closer your task is to what we trained on (variations of vqa), and the more sample efficient it will be.
as to languages, we only trained on English data (there might have been other languages but not on purpose). I might be misunderstanding your task since I am just extrapolating from the mention of Chinese and French :)

@VictorSanh Okay, Can you please give me a rough number? e.g. 100, 1000, 10000. I just want to make sure I don't waste my time as I could use previous conventions as context.

HuggingFaceM4 org

There have been many reports of people fine-tuning on just 1k extremely high quality samples. that's a good "minimum-viable-number-of-instances-rule-of-thumb" you can use

There have been many reports of people fine-tuning on just 1k extremely high quality samples. that's a good "minimum-viable-number-of-instances-rule-of-thumb" you can use

well depends on the subject too no ? and the style you want the model to output ? and the task may influence the number too ? Image captioning ? Prompted image captioning ? Few-shot prompting? Visual question answering? Image classification ? Image-guided text generation? am i wrong ?

HuggingFaceM4 org

yes, as I mentioned, this is a very experimental question!

@VictorSanh Is idefics2-8b better then LlaVA 1.6 34b?

HuggingFaceM4 org

hi @marksuccsmfewercoc ,
you will find that section https://huggingface.co/HuggingFaceM4/idefics2-8b#technical-summary interesting :) in particular the table with all the numbers if you expand.
in summary, idefics2-8b is very competitive with its 30B open couternparts.

HuggingFaceM4 org

closing this discussion, feel free to re-open if necessary!

VictorSanh changed discussion status to closed

@VictorSanh But still can you give me a rough number on how many examples to fine-tune the model so I don't waste my time?

@VictorSanh But still can you give me a rough number on how many examples to fine-tune the model so I don't waste my time?

You serious 😐

@VictorSanh But still can you give me a rough number on how many examples to fine-tune the model so I don't waste my time?

You serious 😐

I'm, I tried fine-tuning 20 examples, but it didn't give me the expected output in a particular format.

Hi @VictorSanh I fine-tuned the model on 50 examples, it gives me output in the correct format but it generates the wrong output is this expected or do I need to fine-tune on more examples or maybe set the system message?

HuggingFaceM4 org

Hi @marksuccsmfewercoc
I am afraid that 50 samples is still too low to consistently and reliably expect fine-tuning to work as expected in the majority of cases.
A safer alternative is to use few-shot prompting (also called in-context learning) when you have so many little samples.

@VictorSanh So should I use few-shot prompting instead of fine-tuning, How many examples do I need for fine-tuning to do some specific task? Any rough number in mind like 100, 200 or 1000 or even more?

HuggingFaceM4 org

i would not recommend any type of fine-tuning given that you have so many little samples. just run straight inference with K in-context samples.
as to fine-tuning, you can refer the eye-balled rule of thumb I mentioned earlier

@VictorSanh But still can you give me a rough number on how many examples to fine-tune the model so I don't waste my time?

You serious 😐

I'm, I tried fine-tuning 20 examples, but it didn't give me the expected output in a particular format.

20 images ....may i know what it is ? also if you look at others model or the idefics colab look at the datasets TheFusion21/PokemonCards 13k moondream uses project-sloth/captcha-images 6k in their example
also whats your task ? if is a labeling /captioning problem you can leverage existing llava , moondream, bunny , xcomposer to help you out , microsoft/kosmos-2-patch14-224 is also great as an assistant to write captions . https://github.com/jhc13/taggui is a good repo that has some model and help you fast or https://labelstud.io/ and if is a image problem ...........stable diffusion may help you generating synthetic data tho it wont be "high quality" also idk if "works" with images maybe @VictorSanh knows better ....something like LIMA (less is more for alignment) works ? but still we are talking min 1000 samples

Sign up or log in to comment