ConvLLaVA-JP Model Card

This is a pretrained checkpoint, you can use it to instruct tune your multimodal models.

Check out the instructions here

Model details

Model type: ConvLLaVA-JP is a vision-language model that can converse about input images.
This model is an LVLM model trained using laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft as the image encoder and llm-jp/llm-jp-1.3b-v1.0 as the text decoder. Supports the input of 768 x 768 high resolution images

Training dataset

Acknowledgement

License

Apache-2.0

Downloads last month
16
Safetensors
Model size
2.1B params
Tensor type
F32
·
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train toshi456/ConvLLaVA-JP-1.3b-768-Pretrain