BLIVA Model Card

Model details

Model type: BLIVA is an open-source Vision-Languagde model trained by initializing from InstructBLIP and alignment with Vicuna on multimodal instruction-finetuning data. It composes of an EVA-CLIP vision encoder, a Q-Former, a projection layer and an auto-regressive language model, based on the decoder only transformer architecture.

Model date: BLIVA_Vicuna was trained in July 2023.

Paper or resources for more information: https://gordonhu608.github.io/bliva/

License: Non-commercial bespoke license

Where to send questions or comments about the model: https://github.com/mlpc-ucsd/BLIVA

Intended use

Primary intended uses: The primary use of BLIVA is research on large multimodal models.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

Pre-train data: 558K filtered image-text pairs from LAION,CC-3M, and SBU. Selected by LLaVA.

Instruction-finetuning data: COCO-Caption, TextCaps, VQAv2, OKVQA, A-OKVQA, LLaVA-150K, OCR-VQA.

Evaluation dataset

For zero-shot evaluation on general image task, we selected Nocaps, Flickr30K, VizWiz, Visual Spaial Reasoning (VSR), IconQA, Visual Dialog, ScienceQA, MSRVTT QA, TextVQA and Hateful Memes.

For zero-shot evaluation on text-rich image OCR task, we selected ST-VQA, OCR-VQA, Text-VQA, and Doc-VQA.

More detials are in our github, https://github.com/mlpc-ucsd/BLIVA

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.