Which Vision Encoder was used here?

by floschne - opened

Do you have any information about the exact vision encoder which was used?

Llava Hugging Face org
edited Mar 18


The CLIP vision encoder by OpenAI was used, as can be seen here in the original implementation.

For BakLLaVa, it is openai/clip-vit-large-patch14-336 as seen here.

Sign up or log in to comment