Tensoic
/

Cerule-v0.1

Image-Text-to-Text

text-generation

Model card Files Files and versions Community

Cerule-v0.1 / README.md

not-lain's picture

update readme and requirements

5ec5990 3 months ago

|

raw history blame

3.5 kB

	---
	license: gemma
	language:
	- en
	pipeline_tag: image-text-to-text
	---

	# Cerule - A Tiny Mighty Vision Model
	### Based on Google's - <span style="color: #D56c76;">Gemma-2b + SigLIP</span>



	```
	██████╗███████╗██████╗ ██╗ ██╗██╗ ███████╗
	██╔════╝██╔════╝██╔══██╗██║ ██║██║ ██╔════╝
	██║ █████╗ ██████╔╝██║ ██║██║ █████╗
	██║ ██╔══╝ ██╔══██╗██║ ██║██║ ██╔══╝
	╚██████╗███████╗██║ ██║╚██████╔╝███████╗███████╗
	╚═════╝╚══════╝╚═╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝
	```

	We train and release "Cerule", a tiny yet powerful Vision Lanuage Model based on the newly released Google's [Gemma-2b](https://huggingface.co/google/gemma-2b) and Google's [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).

	We utilise highly efficient data selection techniques with:
	```
	- Pretraining stage : 650K images (A LAION Subset)
	- Finetuning stage : 695K images (SVIT-mix-665K modified for finetuning(Dataset SOON!))
	```
	The training setup was `4xA100's 80GB` and took ~6 hours to pretrain and ~13 hours to finetune. We modify and adapt the training code from [LLaVA](https://github.com/haotian-liu/LLaVA).

	🚨 Training code, Data and more details to release soon!


	---
	\| Image \| Example \|
	\|-------\|---------\|
	\| ![astronaut](examples/astronaut.png) \| Describe the image<br>The image is a playful and surreal depiction of a man in a space suit, sitting on a chair and holding a green beer bottle. The man is wearing a white space suit, complete with a helmet and gloves. His feet are clad in black and white shoes, and he is placed on a sandy surface. The background features a large, blue planet, with a moon and a star visible in the sky. \|
	\| ![mario](examples/mario.png) \| Who are the characters in the image?<br>The image features three characters, two of them are Mario and Luigi, and the third one is Yoshi.<br><br>Describe the actions of the characters<br>The Mario and Luigi characters are holding their arms out, as if they are waving. Yoshi is standing on its own, with its arms folded. \|
	\| ![extreme_ironing](examples/extreme_ironing.jpg) \| What's funny about this image?<br>The image is quite humorous as it depicts a man ironing clothes on the back of a yellow taxi cab. This is not a typical sight you'd expect to see in everyday life. \|
	---

	## Loading the model

	```
	pip install -qr https://huggingface.co/Tensoic/Cerule-v0.1/resolve/main/requirements.txt
	```

	```python
	from transformers import AutoModelForCausalLM
	model = AutoModelForCausalLM.from_pretrained("Tensoic/Cerule-v0.1", trust_remote_code=True)
	```

	## Training:
	We will release the training code in some time.

	### Inference:
	Clone the following repo and following instructions for a CLI based inference.
	https://github.com/Tensoic-AI/Cerule




	## License
	Model subject to Gemma(base model license) terms of use along with the underlying datasets(LAOIN and SVIT) subject to their respective licenses. All codes are Apache 2.0