batch inference

#1
by luckylight - opened

Hello, I want to ask,how can I use batch to inference?
Since the ViltProcessor can't encoding texts longer than 40, I cut off it to 40(cause if i don't do it, the ViltForImageAndTextRetrieval can not work!).
But there are the processed texts less than 40(without padding), so I could not reorginised it as a whole batch!
Is there any solution to solve this problem? Thanks!

# trunking text code
encoding = processor(image, text, return_tensors="pt")
encoding['input_ids'][0, 39] = encoding['input_ids'][0, -1]
encoding['input_ids'] = encoding['input_ids'][:, :40]
encoding['token_type_ids'][0, 39] = encoding['token_type_ids'][0, -1]
encoding['token_type_ids'] = encoding['token_type_ids'][:, :40]
encoding['attention_mask'][0, 39] = encoding['attention_mask'][0, -1]
encoding['attention_mask'] = encoding['attention_mask'][:, :40]
# reformat it as batch code
cur_batch_data = {x: torch.concat([y, encoding[x]]) for x, y in cur_batch_data.items()}

If this problem can not be solved, I have to evaluate the ViLT for mAP metric with batch=1. To be honest, this is very, very slow. Is there anyone can help me!

luckylight changed discussion status to closed

You can simply use BertTokenizerFast and ViltImageProcessor for encoding text and images separately, with all the benefits of batch encoding and possibility to set parameters by yourself.

Sign up or log in to comment