|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# BLIP-2 SnapGarden |
|
|
|
BLIP-2 SnapGarden is a fine-tuned version of the BLIP-2 model, specifically adapted for the SnapGarden dataset to answer Q&A about plants. |
|
This model is designed to generate small descriptions of images, enhancing the capabilities of image captioning tasks. |
|
|
|
## Model Overview |
|
|
|
BLIP-2 (Bootstrapping Language-Image Pre-training) is a state-of-the-art model that bridges the gap between vision and language understanding. |
|
By lora fine-tuning BLIP-2 on the SnapGarden dataset, this model has learned to generate captions that are contextually relevant and descriptive, making it suitable for applications in image understanding and accessibility tools. |
|
|
|
## SnapGarden Dataset |
|
|
|
The SnapGarden dataset is a curated collection of images focusing on various plant species, gardening activities, and related scenes. |
|
It provides a diverse set of images with corresponding captions, making it ideal for training models in the domain of botany and gardening. |
|
|
|
## Model Details |
|
|
|
Model Name: *BLIP-2 SnapGarden* |
|
Base Model: *BLIP-2* |
|
Fine-tuning Dataset: *Baran657/SnapGarden_v0.6* |
|
Task: *VQA* |
|
|
|
## Usage |
|
To use this model with the Hugging Face transformers library: |
|
|
|
#### Running the model on CPU |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
```python |
|
import requests |
|
from PIL import Image |
|
from transformers import Blip2Processor, Blip2ForConditionalGeneration |
|
|
|
processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden") |
|
model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden") |
|
|
|
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' |
|
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') |
|
|
|
question = "how many dogs are in the picture?" |
|
inputs = processor(raw_image, question, return_tensors="pt") |
|
|
|
out = model.generate(**inputs) |
|
print(processor.decode(out[0], skip_special_tokens=True).strip()) |
|
``` |
|
</details> |
|
|
|
#### Running the model on GPU |
|
|
|
##### In full precision |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
```python |
|
# pip install accelerate |
|
import requests |
|
from PIL import Image |
|
from transformers import Blip2Processor, Blip2ForConditionalGeneration |
|
|
|
processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden") |
|
model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", device_map="auto") |
|
|
|
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' |
|
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') |
|
|
|
question = "how many dogs are in the picture?" |
|
inputs = processor(raw_image, question, return_tensors="pt").to("cuda") |
|
|
|
out = model.generate(**inputs) |
|
print(processor.decode(out[0], skip_special_tokens=True).strip()) |
|
``` |
|
</details> |
|
|
|
##### In half precision (`float16`) |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
```python |
|
# pip install accelerate |
|
import torch |
|
import requests |
|
from PIL import Image |
|
from transformers import Blip2Processor, Blip2ForConditionalGeneration |
|
|
|
processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden") |
|
model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", torch_dtype=torch.float16, device_map="auto") |
|
|
|
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' |
|
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') |
|
|
|
question = "how many dogs are in the picture?" |
|
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16) |
|
|
|
out = model.generate(**inputs) |
|
print(processor.decode(out[0], skip_special_tokens=True).strip()) |
|
``` |
|
</details> |
|
|
|
##### In 8-bit precision (`int8`) |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
```python |
|
# pip install accelerate bitsandbytes |
|
import torch |
|
import requests |
|
from PIL import Image |
|
from transformers import Blip2Processor, Blip2ForConditionalGeneration |
|
|
|
processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden") |
|
model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", load_in_8bit=True, device_map="auto") |
|
|
|
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' |
|
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') |
|
|
|
question = "how many dogs are in the picture?" |
|
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16) |
|
|
|
out = model.generate(**inputs) |
|
print(processor.decode(out[0], skip_special_tokens=True).strip()) |
|
``` |
|
</details> |
|
|
|
## Applications |
|
|
|
Botanical Research: Assisting researchers in identifying and describing house plant species. |
|
Educational Tools: Providing descriptive content for educational materials in botany. |
|
Accessibility: Enhancing image descriptions for visually impaired individuals in gardening contexts. |
|
Limitations |
|
While BLIP-2 SnapGarden performs good in generating captions for plant-related images, it may not generalize effectively to images outside the gardening domain. |
|
Users should be cautious when applying this model to unrelated image datasets. In addition, the training of this model can be optimized and will be done towards the end of this week. |
|
|
|
## License |
|
|
|
This model is distributed under the Apache 2.0 License. |
|
|
|
## Acknowledgements |
|
|
|
The original BLIP-2 model for providing the foundational architecture. |
|
The creators of the SnapGarden dataset for their valuable contribution to the field. |
|
For more details and updates, please visit the Hugging Face model page. |
|
|
|
|