what's the image encoder here (ViT-L or ViT-g) ?
#11
by
ldfandian
- opened
can anyone tells what's the image encoder here (ViT-L or ViT-g) ?
The authors use an EVA-CLIP model as image encoder, which is a ViT with 39 layers as seen here: https://huggingface.co/Salesforce/blip2-opt-2.7b/blob/main/config.json#L223
nielsr
changed discussion status to
closed