|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- damerajee/Llava-pretrain-small |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- Vision Language Model |
|
--- |
|
# GPT-Vision |
|
A very small Vision-Lanaguge Model , Like Llava and Moondream This model has THREE components combined into one |
|
* GPT2 |
|
* VIT-224 |
|
* Multimodality-projector |
|
|
|
Check the github for more information [GPT-Vision-Github](https://github.com/dame-cell/GPT-Vision-1) |
|
# Inference |
|
```python |
|
from transformers import AutoModelForCausalLM |
|
from PIL import Image |
|
|
|
model = AutoModelForCausalLM.from_pretrained("damerajee/GPT-Vision", trust_remote_code=True) |
|
|
|
image_path = "Your_image_path" |
|
image = Image.open(image_path) |
|
image = image.convert('RGB') |
|
|
|
question = "Render a clear and concise summary of the photo." |
|
answer = model.generate(image=image,question=question,max_new_tokens=40) |
|
print("Answer:", answer) |
|
``` |
|
|
|
# Limitations |
|
A fair warning tho guys , this model is only able to generate very short response sometimes it can also repetitive generate the same tokens but even thought it will understands whats on the image |
|
|
|
Further Fine-tuning will make this model better |
|
|