|
--- |
|
{} |
|
--- |
|
# Model Card for Fine-Tuned Paligemma-3B-PT-224 Model |
|
|
|
This model is a fine-tuned version of `google/paligemma-3b-pt-224` using the `peft` library. The fine-tuning process involved the `Multimodal-Fatima/VQAv2_sample_train` dataset, focusing on vision-language tasks. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is designed for vision-language tasks, fine-tuned to answer questions based on images and textual prompts. It leverages advanced quantization techniques and specific configurations to optimize performance and efficiency. |
|
|
|
- **Developed by:** [AmmarAbdelhady](https://ammar-abdelhady-ai.github.io/Ammar-Abdelhady-Portfolio/) |
|
- **Model type:** Vision-Language Model |
|
- **Language(s) (NLP):** English |
|
- **Finetuned from model:** `google/paligemma-3b-pt-224` |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Vision-Language-Model-Fine-Tuning Notebook](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning/blob/main/fine-tuning-of-paligemma-vision-language-model.ipynb) |
|
- **Demo:** [Vision-Language-Model-Fine-Tuning](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model can be used directly for vision-language tasks, including image captioning and visual question answering. |
|
|
|
### Downstream Use |
|
|
|
The model can be fine-tuned further for specific tasks or integrated into larger systems requiring vision-language capabilities. |
|
|
|
### Out-of-Scope Use |
|
|
|
The model is not suitable for tasks unrelated to vision-language processing, such as purely text-based or purely image-based tasks without multimodal interaction. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The model may inherit biases from the training dataset, particularly in terms of visual and textual content. It is crucial to evaluate and mitigate these biases in downstream applications. |
|
|
|
### Recommendations |
|
|
|
Users should be aware of the model's limitations and potential biases. It is recommended to perform thorough evaluations on diverse datasets to understand the model's performance across different scenarios. |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor |
|
import torch |
|
from PIL import Image |
|
import requests |
|
|
|
model = PaliGemmaForConditionalGeneration.from_pretrained('your_model_path') |
|
processor = PaliGemmaProcessor.from_pretrained('your_model_path') |
|
|
|
prompt = "What is on the flower?" |
|
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true" |
|
raw_image = Image.open(requests.get(image_url, stream=True).raw) |
|
inputs = processor(prompt, raw_image, return_tensors="pt") |
|
output = model.generate(**inputs, max_new_tokens=20) |
|
|
|
print(processor.decode(output[0], skip_special_tokens=True)) |
|
|