Model Card for cerebras/Cerebras-LLaVA-13B

The checkpoints consists of Language encoder and projector weights of multimodal LLaVA-13B model trained with our Cerebras implementation and training recipe. The vision encoder checkpoints for this model can be found at cerebras/Cerebras-ViT-L-336-patch14-llava13b-ShareGPT4V

Note: ShareGPT4V is added to the vision model name to ensure correct loading of checkpoints in LLaVA source repository

For full details of this model and training details, please read our upcoming blog post.

License

Model Architecture

Cerebras-LLaVA-13B is a transformer model with the following architecture details

Vision encoder: CLIP-VisionModel-Large . It handles images of size 336 x 336 with patch size of 14
Large Language Model: Pretrained from Vicuna-13B checkpoints and instruction finetuned on various datasets.
Projector: the projector module that connects the LLM and Vision encoder part consists of two linear layers with gelu activation (mlp2x-gelu)

Loading the model

This model can directly be loaded using the LLaVa source code repository. For installation, please refer to the instructions in source code repository. We perform all our evaluations using the LLaVA source code repository scripts.

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model

model_path = "cerebras/Cerebras-LLaVA-13B"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

Intended Use

Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers(both academic and industry) in computer vision, natural language processing, machine learning, and artificial intelligence

Limitations and Bias

The pre-training dataset may have contained offensive or inappropriate content, even after applying data cleansing filters, which can be reflected in the model-generated text. We recommend that users exercise caution when using these models for their applications or any use case that may cause deliberate or unintentional harm to others. This model is for demonstration purpose only.

Acknowledgements

We are thankful to all Cerebras engineers that made this work possible.

cerebras
/

Cerebras-LLaVA-13B