what's the image encoder here (ViT-L or ViT-g) ?

by ldfandian - opened

can anyone tells what's the image encoder here (ViT-L or ViT-g) ?

The authors use an EVA-CLIP model as image encoder, which is a ViT with 39 layers as seen here: https://huggingface.co/Salesforce/blip2-opt-2.7b/blob/main/config.json#L223

nielsr changed discussion status to closed

Sign up or log in to comment