--- license: apache-2.0 datasets: - damerajee/Llava-pretrain-small language: - en library_name: transformers tags: - Vision Language Model --- # GPT-Vision A very small Vision-Lanaguge Model , Like Llava and Moondream This model has THREE components combined into one * GPT2 * VIT-224 * Multimodality-projector Check the github for more information [GPT-Vision-Github](https://github.com/dame-cell/GPT-Vision-1) # Inference ```python from transformers import AutoModelForCausalLM from PIL import Image model = AutoModelForCausalLM.from_pretrained("damerajee/GPT-Vision", trust_remote_code=True) image_path = "Your_image_path" image = Image.open(image_path) image = image.convert('RGB') question = "Render a clear and concise summary of the photo." answer = model.generate(image=image,question=question,max_new_tokens=40) print("Answer:", answer) ``` # Limitations A fair warning tho guys , this model is only able to generate very short response sometimes it can also repetitive generate the same tokens but even thought it will understands whats on the image Further Fine-tuning will make this model better