File size: 4,663 Bytes
357f72e aabc31a 357f72e aabc31a 357f72e aabc31a 357f72e aabc31a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
---
license: apache-2.0
language:
- en
pipeline_tag: image-text-to-text
---
# Cerule - A Tiny Mighty Vision Model
### Based on Google's - <span style="color: #D56c76;">Gemma-2b + SigLIP</span>
```
ββββββββββββββββββββββ βββ ββββββ ββββββββ
βββββββββββββββββββββββββββ ββββββ ββββββββ
βββ ββββββ βββββββββββ ββββββ ββββββ
βββ ββββββ βββββββββββ ββββββ ββββββ
βββββββββββββββββββ ββββββββββββββββββββββββββββ
ββββββββββββββββββ βββ βββββββ ββββββββββββββββ
```
We train and release "Cerule", a tiny yet powerful Vision Lanuage Model based on the newly released Google's [Gemma-2b](https://huggingface.co/google/gemma-2b) and Google's [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).
We utilise highly efficient data selection techniques with:
```
- Pretraining stage : 650K images (A LAION 2M Subset)
- Finetuning stage : 695K images (SVIT-mix-665K modified for finetuning(Dataset SOON!))
```
The training setup was `4xA100's 80GB` and took ~6 hours to pretrain and ~13 hours to finetune. We modify and adapt the training code from [LLaVA](https://github.com/haotian-liu/LLaVA).
π¨ Training code, Data and more details to release soon!
---
| Image | Example |
|-------|---------|
|  | **Describe the image**<br>The image is a playful and surreal depiction of a man in a space suit, sitting on a chair and holding a green beer bottle. The man is wearing a white space suit, complete with a helmet and gloves. His feet are clad in black and white shoes, and he is placed on a sandy surface. The background features a large, blue planet, with a moon and a star visible in the sky. |
|  | **Who are the characters in the image?**<br>The image features three characters, two of them are Mario and Luigi, and the third one is Yoshi.<br><br>**Describe the actions of the characters**<br>The Mario and Luigi characters are holding their arms out, as if they are waving. Yoshi is standing on its own, with its arms folded. |
|  | **What's funny about this image?**<br>The image is quite humorous as it depicts a man ironing clothes on the back of a yellow taxi cab. This is not a typical sight you'd expect to see in everyday life. |
---
## Training and Inference:
We will release the training code in some time.
### Inference:
**Please note that running the inference code at this stage may result in errors**. The proper code for training and inference shall be released soon!
Before running the snippet, you need to install the following dependencies:
```shell
pip install torch transformers accelerate pillow
```
```python
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')
torch.set_default_device('cuda') # or 'cpu'
model = AutoModelForCausalLM.from_pretrained(
'Tensoic/Cerule',
torch_dtype=torch.float16,
device_map='auto',
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
'Tensoic/Cerule',
trust_remote_code=True)
# text prompt
prompt = 'Who are these charecters?'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)
image = Image.open('examples/mario.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
# generate
output_ids = model.generate(
input_ids,
images=image_tensor,
max_new_tokens=100,
use_cache=False)[0] #keep use_cache=False or else it might run into some torch dim error
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=False).strip())
```
## License
Apache 2.0 |