metadata
license: gemma
language:
- en
pipeline_tag: image-text-to-text
Cerule - A Tiny Mighty Vision Model
Based on Google's - Gemma-2b + SigLIP
ββββββββββββββββββββββ βββ ββββββ ββββββββ
βββββββββββββββββββββββββββ ββββββ ββββββββ
βββ ββββββ βββββββββββ ββββββ ββββββ
βββ ββββββ βββββββββββ ββββββ ββββββ
βββββββββββββββββββ ββββββββββββββββββββββββββββ
ββββββββββββββββββ βββ βββββββ ββββββββββββββββ
We train and release "Cerule", a tiny yet powerful Vision Lanuage Model based on the newly released Google's Gemma-2b and Google's SigLIP.
We utilise highly efficient data selection techniques with:
- Pretraining stage : 650K images (A LAION Subset)
- Finetuning stage : 695K images (SVIT-mix-665K modified for finetuning(Dataset SOON!))
The training setup was 4xA100's 80GB
and took ~6 hours to pretrain and ~13 hours to finetune. We modify and adapt the training code from LLaVA.
π¨ Training code, Data and more details to release soon!
Training:
We will release the training code in some time.
Inference:
Clone the following repo and following instructions for a CLI based inference. https://github.com/Tensoic-AI/Cerule
License
Model subject to Gemma(base model license) terms of use along with the underlying datasets(LAOIN and SVIT) subject to their respective licenses. All codes are Apache 2.0