fauzail
/

Florence-2-VQA

@@ -1,7 +1,9 @@
 # Florence 2 VQA - Engineering Drawings
 ## Model Overview
-The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the drawing.
 ---
@@ -16,13 +18,16 @@ The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) t
 ## How to Use the Model
 ### **Install Dependencies**
 Make sure you have the required libraries installed:
 ```bash
 pip install transformers torch datasets pillow gradio
-#Load the model and processor
 from transformers import AutoConfig, AutoModelForCausalLM
 import torch
@@ -40,8 +45,13 @@ model = AutoModelForCausalLM.from_pretrained(
     trust_remote_code=True
 ).to(device)
-# Define the Prediction Function
 def predict(image_path, question):
     from PIL import Image
@@ -57,24 +67,36 @@ def predict(image_path, question):
     # Decode the output tokens into a human-readable format
     answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
     return answer
-#Test it for Example
-image_path = "test.png"
-question = "tell me in detail about the image?"
 # Call the prediction function
 answer = predict(image_path, question)
 print("Answer:", answer)
-#Alrernative Use Gradio
 import gradio as gr
 def predict(image, question):
     inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
     outputs = model.generate(**inputs)
     return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
 interface = gr.Interface(
     fn=predict,
     inputs=["image", "text"],
@@ -82,4 +104,42 @@ interface = gr.Interface(
     title="Florence 2 VQA - Engineering Drawings",
     description="Upload an engineering drawing and ask a related question."
 )
 interface.launch()

+```markdown
 # Florence 2 VQA - Engineering Drawings
 ## Model Overview
+The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the drawing.
 ---
 ## How to Use the Model
 ### **Install Dependencies**
 Make sure you have the required libraries installed:
 ```bash
 pip install transformers torch datasets pillow gradio
+```
+### **Load the Model and Processor**
+To load the model and processor for inference, use the following code:
+```python
 from transformers import AutoConfig, AutoModelForCausalLM
 import torch
     trust_remote_code=True
 ).to(device)
+```
+### **Define the Prediction Function**
+Once the model and processor are loaded, define a prediction function that takes an image and question as input:
+```python
 def predict(image_path, question):
     from PIL import Image
     # Decode the output tokens into a human-readable format
     answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
     return answer
+```
+### **Test It for Example**
+Now, test the model using an image and a question:
+```python
+image_path = "test.png"  # Replace with your image path
+question = "Tell me in detail about the image?"
 # Call the prediction function
 answer = predict(image_path, question)
 print("Answer:", answer)
+```
+### **Alternative: Use Gradio for Interactive Web Interface**
+If you prefer an interactive interface, you can use Gradio to deploy the model:
+```python
 import gradio as gr
+from PIL import Image
+# Define the prediction function for Gradio
 def predict(image, question):
     inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
     outputs = model.generate(**inputs)
     return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
+# Create the Gradio interface
 interface = gr.Interface(
     fn=predict,
     inputs=["image", "text"],
     title="Florence 2 VQA - Engineering Drawings",
     description="Upload an engineering drawing and ask a related question."
 )
+# Launch the Gradio interface
 interface.launch()
+```
+---
+## Training Details
+- **Preprocessing**:
+  - Images were resized and normalized.
+  - Text data (questions and answers) was tokenized using the Florence tokenizer.
+- **Hyperparameters**:
+  - **Learning Rate**: `1e-6`
+  - **Batch Size**: `2`
+  - **Gradient Accumulation Steps**: `4`
+  - **Epochs**: `10`
+Training was performed using mixed precision for efficiency.
+---
+## Limitations
+- This model is specifically fine-tuned for engineering drawings and may not perform well on general-purpose images or questions.
+- Answers maybe inaccurate.
+---
+## Acknowledgments
+- **Base Model**: [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
+For further assistance, feel free to reach out or raise an issue in the repository.
+```
+### Key Sections Added:
+1. **Model Loading**: The section explains how to load the model and processor in detail.
+2. **Prediction Function**: Code to define a function that will use the model for inference on a given image and question.
+3. **Gradio Interface**: Provides an example of deploying the model using Gradio for interactive usage.