What is the minimum number of examples to fine-tune IDEFICS-2 to get the expected output?

#14

by marksuccsmfewercoc - opened 19 days ago

19 days ago

What is the minimum number of examples to fine-tune IDEFICS-2 to get the expected output? Should I fine-tune this to 100 examples or more? Let's suppose the task is to take a prompt "How many languages can you see in the images?" and get the answer in this format like { languages: ['English', 'Chinese', 'French']}

VictorSanh

HuggingFaceM4 org 19 days ago

Hi @marksuccsmfewercoc , it is a hard question to answer in a general way. The minimum number of samples you will need to get decent performance is very task specific... i would encourage you to try experimentally. Intuitively, the closer your task is to what we trained on (variations of vqa), and the more sample efficient it will be.
as to languages, we only trained on English data (there might have been other languages but not on purpose). I might be misunderstanding your task since I am just extrapolating from the mention of Chinese and French :)

marksuccsmfewercoc

19 days ago

@VictorSanh Okay, Can you please give me a rough number? e.g. 100, 1000, 10000. I just want to make sure I don't waste my time as I could use previous conventions as context.

VictorSanh

HuggingFaceM4 org 19 days ago

There have been many reports of people fine-tuning on just 1k extremely high quality samples. that's a good "minimum-viable-number-of-instances-rule-of-thumb" you can use

nicolollo

19 days ago

There have been many reports of people fine-tuning on just 1k extremely high quality samples. that's a good "minimum-viable-number-of-instances-rule-of-thumb" you can use

well depends on the subject too no ? and the style you want the model to output ? and the task may influence the number too ? Image captioning ? Prompted image captioning ? Few-shot prompting? Visual question answering? Image classification ? Image-guided text generation? am i wrong ?

VictorSanh

HuggingFaceM4 org 19 days ago

yes, as I mentioned, this is a very experimental question!

marksuccsmfewercoc

17 days ago

@VictorSanh Is idefics2-8b better then LlaVA 1.6 34b?

VictorSanh

HuggingFaceM4 org 17 days ago

hi @marksuccsmfewercoc ,
you will find that section https://huggingface.co/HuggingFaceM4/idefics2-8b#technical-summary interesting :) in particular the table with all the numbers if you expand.
in summary, idefics2-8b is very competitive with its 30B open couternparts.

VictorSanh

HuggingFaceM4 org 17 days ago

closing this discussion, feel free to re-open if necessary!

VictorSanh changed discussion status to closed 17 days ago

marksuccsmfewercoc

16 days ago

@VictorSanh But still can you give me a rough number on how many examples to fine-tune the model so I don't waste my time?

nicolollo

15 days ago

@VictorSanh But still can you give me a rough number on how many examples to fine-tune the model so I don't waste my time?

You serious 😐

marksuccsmfewercoc

13 days ago

@VictorSanh But still can you give me a rough number on how many examples to fine-tune the model so I don't waste my time?

You serious 😐

I'm, I tried fine-tuning 20 examples, but it didn't give me the expected output in a particular format.

marksuccsmfewercoc

13 days ago

Hi @VictorSanh I fine-tuned the model on 50 examples, it gives me output in the correct format but it generates the wrong output is this expected or do I need to fine-tune on more examples or maybe set the system message?

VictorSanh

HuggingFaceM4 org 13 days ago

Hi @marksuccsmfewercoc
I am afraid that 50 samples is still too low to consistently and reliably expect fine-tuning to work as expected in the majority of cases.
A safer alternative is to use few-shot prompting (also called in-context learning) when you have so many little samples.

marksuccsmfewercoc

13 days ago

@VictorSanh So should I use few-shot prompting instead of fine-tuning, How many examples do I need for fine-tuning to do some specific task? Any rough number in mind like 100, 200 or 1000 or even more?

VictorSanh

HuggingFaceM4 org 12 days ago

i would not recommend any type of fine-tuning given that you have so many little samples. just run straight inference with K in-context samples.
as to fine-tuning, you can refer the eye-balled rule of thumb I mentioned earlier

nicolollo

11 days ago

•

edited 11 days ago

@VictorSanh But still can you give me a rough number on how many examples to fine-tune the model so I don't waste my time?

You serious 😐

I'm, I tried fine-tuning 20 examples, but it didn't give me the expected output in a particular format.

20 images ....may i know what it is ? also if you look at others model or the idefics colab look at the datasets TheFusion21/PokemonCards 13k moondream uses project-sloth/captcha-images 6k in their example
also whats your task ? if is a labeling /captioning problem you can leverage existing llava , moondream, bunny , xcomposer to help you out , microsoft/kosmos-2-patch14-224 is also great as an assistant to write captions . https://github.com/jhc13/taggui is a good repo that has some model and help you fast or https://labelstud.io/ and if is a image problem ...........stable diffusion may help you generating synthetic data tho it wont be "high quality" also idk if "works" with images maybe @VictorSanh knows better ....something like LIMA (less is more for alignment) works ? but still we are talking min 1000 samples

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment