pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b Model Card

Note: This is the pretrained model used for OLA-VLM-CLIP-ViT-Llama3-8b.

OLA-VLM distills target visual information into the intermediate representations of the LLM from a set of target encoders. It adopts a predictive embedding optimization approach at selected LLM layers during training to minimize the embedding losses along with the next token prediction (NTP) objective, resulting in a vision-centric approach to training the Multimodal Large Language Model.

Citation

If you found our work useful in your research, please consider starring โญ us on GitHub and citing ๐Ÿ“š us in your research!

@article{jain2024ola_vlm,
    title={{OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation}},
    author={Jitesh Jain and Zhengyuan Yang and Humphrey Shi and Jianfeng Gao and Jianwei Yang},
    journal={arXiv},
    year={2024}
}
Downloads last month
201
Safetensors
Model size
8.55B params
Tensor type
F32
ยท
BF16
ยท
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b

Finetuned
(485)
this model
Finetunes
1 model

Space using shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b 1

Collection including shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b