damerajee
/

GPT-Vision

Vision Language Model

Inference Endpoints

Model card Files Files and versions Community

GPT-Vision / README.md

damerajee's picture

Update README.md

250cf9d verified 6 months ago

|

history blame contribute delete

1.12 kB

	---
	license: apache-2.0
	datasets:
	- damerajee/Llava-pretrain-small
	language:
	- en
	library_name: transformers
	tags:
	- Vision Language Model
	---
	# GPT-Vision
	A very small Vision-Lanaguge Model , Like Llava and Moondream This model has THREE components combined into one
	* GPT2
	* VIT-224
	* Multimodality-projector

	Check the github for more information [GPT-Vision-Github](https://github.com/dame-cell/GPT-Vision-1)
	# Inference
	```python
	from transformers import AutoModelForCausalLM
	from PIL import Image

	model = AutoModelForCausalLM.from_pretrained("damerajee/GPT-Vision", trust_remote_code=True)

	image_path = "Your_image_path"
	image = Image.open(image_path)
	image = image.convert('RGB')

	question = "Render a clear and concise summary of the photo."
	answer = model.generate(image=image,question=question,max_new_tokens=40)
	print("Answer:", answer)
	```

	# Limitations
	A fair warning tho guys , this model is only able to generate very short response sometimes it can also repetitive generate the same tokens but even thought it will understands whats on the image

	Further Fine-tuning will make this model better