Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,9 @@
|
|
|
|
|
|
1 |
# Florence 2 VQA - Engineering Drawings
|
2 |
|
3 |
## Model Overview
|
4 |
-
The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the drawing.
|
5 |
|
6 |
---
|
7 |
|
@@ -16,13 +18,16 @@ The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) t
|
|
16 |
## How to Use the Model
|
17 |
|
18 |
### **Install Dependencies**
|
19 |
-
|
20 |
Make sure you have the required libraries installed:
|
21 |
```bash
|
22 |
pip install transformers torch datasets pillow gradio
|
|
|
|
|
|
|
23 |
|
24 |
-
|
25 |
|
|
|
26 |
from transformers import AutoConfig, AutoModelForCausalLM
|
27 |
import torch
|
28 |
|
@@ -40,8 +45,13 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
40 |
trust_remote_code=True
|
41 |
).to(device)
|
42 |
|
43 |
-
|
44 |
|
|
|
|
|
|
|
|
|
|
|
45 |
def predict(image_path, question):
|
46 |
from PIL import Image
|
47 |
|
@@ -57,24 +67,36 @@ def predict(image_path, question):
|
|
57 |
# Decode the output tokens into a human-readable format
|
58 |
answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
|
59 |
return answer
|
|
|
|
|
|
|
60 |
|
61 |
-
|
62 |
|
63 |
-
|
64 |
-
|
|
|
65 |
|
66 |
# Call the prediction function
|
67 |
answer = predict(image_path, question)
|
68 |
print("Answer:", answer)
|
|
|
69 |
|
70 |
-
|
|
|
|
|
|
|
|
|
71 |
import gradio as gr
|
|
|
72 |
|
|
|
73 |
def predict(image, question):
|
74 |
inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
|
75 |
outputs = model.generate(**inputs)
|
76 |
return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
|
77 |
|
|
|
78 |
interface = gr.Interface(
|
79 |
fn=predict,
|
80 |
inputs=["image", "text"],
|
@@ -82,4 +104,42 @@ interface = gr.Interface(
|
|
82 |
title="Florence 2 VQA - Engineering Drawings",
|
83 |
description="Upload an engineering drawing and ask a related question."
|
84 |
)
|
|
|
|
|
85 |
interface.launch()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
```markdown
|
3 |
# Florence 2 VQA - Engineering Drawings
|
4 |
|
5 |
## Model Overview
|
6 |
+
The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the drawing.
|
7 |
|
8 |
---
|
9 |
|
|
|
18 |
## How to Use the Model
|
19 |
|
20 |
### **Install Dependencies**
|
|
|
21 |
Make sure you have the required libraries installed:
|
22 |
```bash
|
23 |
pip install transformers torch datasets pillow gradio
|
24 |
+
```
|
25 |
+
|
26 |
+
### **Load the Model and Processor**
|
27 |
|
28 |
+
To load the model and processor for inference, use the following code:
|
29 |
|
30 |
+
```python
|
31 |
from transformers import AutoConfig, AutoModelForCausalLM
|
32 |
import torch
|
33 |
|
|
|
45 |
trust_remote_code=True
|
46 |
).to(device)
|
47 |
|
48 |
+
```
|
49 |
|
50 |
+
### **Define the Prediction Function**
|
51 |
+
|
52 |
+
Once the model and processor are loaded, define a prediction function that takes an image and question as input:
|
53 |
+
|
54 |
+
```python
|
55 |
def predict(image_path, question):
|
56 |
from PIL import Image
|
57 |
|
|
|
67 |
# Decode the output tokens into a human-readable format
|
68 |
answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
|
69 |
return answer
|
70 |
+
```
|
71 |
+
|
72 |
+
### **Test It for Example**
|
73 |
|
74 |
+
Now, test the model using an image and a question:
|
75 |
|
76 |
+
```python
|
77 |
+
image_path = "test.png" # Replace with your image path
|
78 |
+
question = "Tell me in detail about the image?"
|
79 |
|
80 |
# Call the prediction function
|
81 |
answer = predict(image_path, question)
|
82 |
print("Answer:", answer)
|
83 |
+
```
|
84 |
|
85 |
+
### **Alternative: Use Gradio for Interactive Web Interface**
|
86 |
+
|
87 |
+
If you prefer an interactive interface, you can use Gradio to deploy the model:
|
88 |
+
|
89 |
+
```python
|
90 |
import gradio as gr
|
91 |
+
from PIL import Image
|
92 |
|
93 |
+
# Define the prediction function for Gradio
|
94 |
def predict(image, question):
|
95 |
inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
|
96 |
outputs = model.generate(**inputs)
|
97 |
return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
|
98 |
|
99 |
+
# Create the Gradio interface
|
100 |
interface = gr.Interface(
|
101 |
fn=predict,
|
102 |
inputs=["image", "text"],
|
|
|
104 |
title="Florence 2 VQA - Engineering Drawings",
|
105 |
description="Upload an engineering drawing and ask a related question."
|
106 |
)
|
107 |
+
|
108 |
+
# Launch the Gradio interface
|
109 |
interface.launch()
|
110 |
+
```
|
111 |
+
|
112 |
+
---
|
113 |
+
|
114 |
+
## Training Details
|
115 |
+
- **Preprocessing**:
|
116 |
+
- Images were resized and normalized.
|
117 |
+
- Text data (questions and answers) was tokenized using the Florence tokenizer.
|
118 |
+
- **Hyperparameters**:
|
119 |
+
- **Learning Rate**: `1e-6`
|
120 |
+
- **Batch Size**: `2`
|
121 |
+
- **Gradient Accumulation Steps**: `4`
|
122 |
+
- **Epochs**: `10`
|
123 |
+
|
124 |
+
Training was performed using mixed precision for efficiency.
|
125 |
+
|
126 |
+
---
|
127 |
+
|
128 |
+
## Limitations
|
129 |
+
- This model is specifically fine-tuned for engineering drawings and may not perform well on general-purpose images or questions.
|
130 |
+
- Answers maybe inaccurate.
|
131 |
+
|
132 |
+
---
|
133 |
+
|
134 |
+
## Acknowledgments
|
135 |
+
- **Base Model**: [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
|
136 |
+
|
137 |
+
For further assistance, feel free to reach out or raise an issue in the repository.
|
138 |
+
```
|
139 |
+
|
140 |
+
### Key Sections Added:
|
141 |
+
|
142 |
+
1. **Model Loading**: The section explains how to load the model and processor in detail.
|
143 |
+
2. **Prediction Function**: Code to define a function that will use the model for inference on a given image and question.
|
144 |
+
3. **Gradio Interface**: Provides an example of deploying the model using Gradio for interactive usage.
|
145 |
+
|