dandelin/vilt-b32-finetuned-coco

Apr 27, 2023

Hello, I want to ask,how can I use batch to inference?
Since the ViltProcessor can't encoding texts longer than 40, I cut off it to 40(cause if i don't do it, the ViltForImageAndTextRetrieval can not work!).
But there are the processed texts less than 40(without padding), so I could not reorginised it as a whole batch!
Is there any solution to solve this problem? Thanks!

# trunking text code
encoding = processor(image, text, return_tensors="pt")
encoding['input_ids'][0, 39] = encoding['input_ids'][0, -1]
encoding['input_ids'] = encoding['input_ids'][:, :40]
encoding['token_type_ids'][0, 39] = encoding['token_type_ids'][0, -1]
encoding['token_type_ids'] = encoding['token_type_ids'][:, :40]
encoding['attention_mask'][0, 39] = encoding['attention_mask'][0, -1]
encoding['attention_mask'] = encoding['attention_mask'][:, :40]

# reformat it as batch code
cur_batch_data = {x: torch.concat([y, encoding[x]]) for x, y in cur_batch_data.items()}

luckylight

Apr 27, 2023

•

edited Apr 27, 2023

If this problem can not be solved, I have to evaluate the ViLT for mAP metric with batch=1. To be honest, this is very, very slow. Is there anyone can help me!

luckylight changed discussion status to closed Jul 15, 2023

lucadiliello

Jan 12

You can simply use BertTokenizerFast and ViltImageProcessor for encoding text and images separately, with all the benefits of batch encoding and possibility to set parameters by yourself.

dandelin
/

vilt-b32-finetuned-coco

batch inference