sections/vqa_usage.md · flax-community/Multilingual-VQA at a7e5eb44570ad7f02f2a27b271ca15461ec74519

This demo loads the FlaxCLIPVisionBertForSequenceClassification present in the model directory of this repository. The checkpoint is loaded from flax-community/clip-vision-bert-vqa-ft-6k which is pre-trained checkpoint with 60k steps and 6k fine-tuning steps. 100 random validation set examples are present in the dummy_vqa_multilingual.tsv with respective images in the images/val2014 directory.
We provide English Translation of the question for users who are not well-acquainted with the other languages. This is done using mtranslate to keep things flexible enough and needs internet connection as it uses the Google Translate API.
The model predicts the answers from a list of 3129 answers which have their labels present in answer_reverse_mapping.json.
Lastly, one can choose the Answer Language which also uses a saved dictionary created using mtranslate library for the 3129 answer options.
The top-5 predictions are displayed below and their respective confidence scores are shown in form of a bar plot.