OFA-Sys/ofa-tiny · Fine tuning on SNLI-VE (visual entailment) with transformers models & Trainer

Hi,

Thank you for publishing your outperforming models to 'transformers' library format, a good step to make them become foundation models!

Has anyone succeeded in reproducing the results (accuracy) obtained in the OFA paper, on SNLI-VE (visual entailment), with from_pretrained() models and fine tuning with huggingface Trainer()? I tried to with OFA-tiny and OFA-base, but though validation accuracy shows normal progress during training and confusion matrix seems normal, it ends around ~10 points below expected performance. I tried to catch all relevant parameters (same prompt including spaces and quotation marks, image mean & std at 0.5, different image size between models, encoder_drop_path_rate = 0.1, decoder_drop_path_rate = 0.1, 5 epochs, warmup ratio = 0.06, peak lr = 3e-5 then decreasing, AdamW weight decay = 1e-2...) but I may have missed something.

Thanks for your interest :-)

François