MoRa2001
/

PaliGemma-3b-VQAv2

Model card Files Files and versions Community

PaliGemma-3b-VQAv2 / README.md

MoRa2001's picture

Upload processor

f2b8ee1 verified 3 months ago

|

history blame contribute delete

2.77 kB

	---
	{}
	---
	# Model Card for Fine-Tuned Paligemma-3B-PT-224 Model

	This model is a fine-tuned version of `google/paligemma-3b-pt-224` using the `peft` library. The fine-tuning process involved the `Multimodal-Fatima/VQAv2_sample_train` dataset, focusing on vision-language tasks.

	## Model Details

	### Model Description

	This model is designed for vision-language tasks, fine-tuned to answer questions based on images and textual prompts. It leverages advanced quantization techniques and specific configurations to optimize performance and efficiency.

	- Developed by: [AmmarAbdelhady](https://ammar-abdelhady-ai.github.io/Ammar-Abdelhady-Portfolio/)
	- Model type: Vision-Language Model
	- Language(s) (NLP): English
	- Finetuned from model: `google/paligemma-3b-pt-224`

	### Model Sources

	- Repository: [Vision-Language-Model-Fine-Tuning Notebook](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning/blob/main/fine-tuning-of-paligemma-vision-language-model.ipynb)
	- Demo: [Vision-Language-Model-Fine-Tuning](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning)

	## Uses

	### Direct Use

	This model can be used directly for vision-language tasks, including image captioning and visual question answering.

	### Downstream Use

	The model can be fine-tuned further for specific tasks or integrated into larger systems requiring vision-language capabilities.

	### Out-of-Scope Use

	The model is not suitable for tasks unrelated to vision-language processing, such as purely text-based or purely image-based tasks without multimodal interaction.

	## Bias, Risks, and Limitations

	The model may inherit biases from the training dataset, particularly in terms of visual and textual content. It is crucial to evaluate and mitigate these biases in downstream applications.

	### Recommendations

	Users should be aware of the model's limitations and potential biases. It is recommended to perform thorough evaluations on diverse datasets to understand the model's performance across different scenarios.

	## How to Get Started with the Model

	```python
	from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
	import torch
	from PIL import Image
	import requests

	model = PaliGemmaForConditionalGeneration.from_pretrained('your_model_path')
	processor = PaliGemmaProcessor.from_pretrained('your_model_path')

	prompt = "What is on the flower?"
	image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"
	raw_image = Image.open(requests.get(image_url, stream=True).raw)
	inputs = processor(prompt, raw_image, return_tensors="pt")
	output = model.generate(**inputs, max_new_tokens=20)

	print(processor.decode(output[0], skip_special_tokens=True))