Update README.md

d0b2b56 verified 25 days ago

5.48 kB

	---
	license: apache-2.0
	---

	# BLIP-2 SnapGarden

	BLIP-2 SnapGarden is a fine-tuned version of the BLIP-2 model, specifically adapted for the SnapGarden dataset to answer Q&A about plants.
	This model is designed to generate small descriptions of images, enhancing the capabilities of image captioning tasks.

	## Model Overview

	BLIP-2 (Bootstrapping Language-Image Pre-training) is a state-of-the-art model that bridges the gap between vision and language understanding.
	By lora fine-tuning BLIP-2 on the SnapGarden dataset, this model has learned to generate captions that are contextually relevant and descriptive, making it suitable for applications in image understanding and accessibility tools.

	## SnapGarden Dataset

	The SnapGarden dataset is a curated collection of images focusing on various plant species, gardening activities, and related scenes.
	It provides a diverse set of images with corresponding captions, making it ideal for training models in the domain of botany and gardening.

	## Model Details

	Model Name: BLIP-2 SnapGarden
	Base Model: BLIP-2
	Fine-tuning Dataset: Baran657/SnapGarden_v0.6
	Task: VQA

	## Usage
	To use this model with the Hugging Face transformers library:

	#### Running the model on CPU

	<details>
	<summary> Click to expand </summary>

	```python
	import requests
	from PIL import Image
	from transformers import Blip2Processor, Blip2ForConditionalGeneration

	processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden")
	model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden")

	img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
	raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

	question = "how many dogs are in the picture?"
	inputs = processor(raw_image, question, return_tensors="pt")

	out = model.generate(**inputs)
	print(processor.decode(out[0], skip_special_tokens=True).strip())
	```
	</details>

	#### Running the model on GPU

	##### In full precision

	<details>
	<summary> Click to expand </summary>

	```python
	# pip install accelerate
	import requests
	from PIL import Image
	from transformers import Blip2Processor, Blip2ForConditionalGeneration

	processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden")
	model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", device_map="auto")

	img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
	raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

	question = "how many dogs are in the picture?"
	inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

	out = model.generate(**inputs)
	print(processor.decode(out[0], skip_special_tokens=True).strip())
	```
	</details>

	##### In half precision (`float16`)

	<details>
	<summary> Click to expand </summary>

	```python
	# pip install accelerate
	import torch
	import requests
	from PIL import Image
	from transformers import Blip2Processor, Blip2ForConditionalGeneration

	processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden")
	model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", torch_dtype=torch.float16, device_map="auto")

	img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
	raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

	question = "how many dogs are in the picture?"
	inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

	out = model.generate(**inputs)
	print(processor.decode(out[0], skip_special_tokens=True).strip())
	```
	</details>

	##### In 8-bit precision (`int8`)

	<details>
	<summary> Click to expand </summary>

	```python
	# pip install accelerate bitsandbytes
	import torch
	import requests
	from PIL import Image
	from transformers import Blip2Processor, Blip2ForConditionalGeneration

	processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden")
	model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", load_in_8bit=True, device_map="auto")

	img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
	raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

	question = "how many dogs are in the picture?"
	inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

	out = model.generate(**inputs)
	print(processor.decode(out[0], skip_special_tokens=True).strip())
	```
	</details>

	## Applications

	Botanical Research: Assisting researchers in identifying and describing house plant species.
	Educational Tools: Providing descriptive content for educational materials in botany.
	Accessibility: Enhancing image descriptions for visually impaired individuals in gardening contexts.
	Limitations
	While BLIP-2 SnapGarden performs good in generating captions for plant-related images, it may not generalize effectively to images outside the gardening domain.
	Users should be cautious when applying this model to unrelated image datasets. In addition, the training of this model can be optimized and will be done towards the end of this week.

	## License

	This model is distributed under the Apache 2.0 License.

	## Acknowledgements

	The original BLIP-2 model for providing the foundational architecture.
	The creators of the SnapGarden dataset for their valuable contribution to the field.
	For more details and updates, please visit the Hugging Face model page.