LAPVQA โ€” VQA (Captioning-Pretrained Encoder)

Part of the LAPVQA collection.

Description

VQA task head trained on top of the LAPVQA captioning-pretrained encoder (lapvqa-pretrain-captioning). The encoder is kept frozen; this checkpoint contains the VQAHead state dict only. The encoder outputs 1024-dim patch tokens (ViT-L/14).

Loading

import torch
from lapvqa.vqa.model import VQAHead

ckpt = torch.load("pretrain-captioning_best.pt", map_location="cpu")
head = VQAHead(vis_dim=1024)
head.load_state_dict(ckpt)
head.eval()
# pair with encoder_final.pt from lapvqa-pretrain-captioning
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including dmusingu/lapvqa-vqa-pretrain-captioning