fauzail commited on
Commit
391c1bc
·
verified ·
1 Parent(s): 9225694

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -8
README.md CHANGED
@@ -1,7 +1,9 @@
 
 
1
  # Florence 2 VQA - Engineering Drawings
2
 
3
  ## Model Overview
4
- The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the drawing.
5
 
6
  ---
7
 
@@ -16,13 +18,16 @@ The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) t
16
  ## How to Use the Model
17
 
18
  ### **Install Dependencies**
19
-
20
  Make sure you have the required libraries installed:
21
  ```bash
22
  pip install transformers torch datasets pillow gradio
 
 
 
23
 
24
- #Load the model and processor
25
 
 
26
  from transformers import AutoConfig, AutoModelForCausalLM
27
  import torch
28
 
@@ -40,8 +45,13 @@ model = AutoModelForCausalLM.from_pretrained(
40
  trust_remote_code=True
41
  ).to(device)
42
 
43
- # Define the Prediction Function
44
 
 
 
 
 
 
45
  def predict(image_path, question):
46
  from PIL import Image
47
 
@@ -57,24 +67,36 @@ def predict(image_path, question):
57
  # Decode the output tokens into a human-readable format
58
  answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
59
  return answer
 
 
 
60
 
61
- #Test it for Example
62
 
63
- image_path = "test.png"
64
- question = "tell me in detail about the image?"
 
65
 
66
  # Call the prediction function
67
  answer = predict(image_path, question)
68
  print("Answer:", answer)
 
69
 
70
- #Alrernative Use Gradio
 
 
 
 
71
  import gradio as gr
 
72
 
 
73
  def predict(image, question):
74
  inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
75
  outputs = model.generate(**inputs)
76
  return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
77
 
 
78
  interface = gr.Interface(
79
  fn=predict,
80
  inputs=["image", "text"],
@@ -82,4 +104,42 @@ interface = gr.Interface(
82
  title="Florence 2 VQA - Engineering Drawings",
83
  description="Upload an engineering drawing and ask a related question."
84
  )
 
 
85
  interface.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ```markdown
3
  # Florence 2 VQA - Engineering Drawings
4
 
5
  ## Model Overview
6
+ The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the drawing.
7
 
8
  ---
9
 
 
18
  ## How to Use the Model
19
 
20
  ### **Install Dependencies**
 
21
  Make sure you have the required libraries installed:
22
  ```bash
23
  pip install transformers torch datasets pillow gradio
24
+ ```
25
+
26
+ ### **Load the Model and Processor**
27
 
28
+ To load the model and processor for inference, use the following code:
29
 
30
+ ```python
31
  from transformers import AutoConfig, AutoModelForCausalLM
32
  import torch
33
 
 
45
  trust_remote_code=True
46
  ).to(device)
47
 
48
+ ```
49
 
50
+ ### **Define the Prediction Function**
51
+
52
+ Once the model and processor are loaded, define a prediction function that takes an image and question as input:
53
+
54
+ ```python
55
  def predict(image_path, question):
56
  from PIL import Image
57
 
 
67
  # Decode the output tokens into a human-readable format
68
  answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
69
  return answer
70
+ ```
71
+
72
+ ### **Test It for Example**
73
 
74
+ Now, test the model using an image and a question:
75
 
76
+ ```python
77
+ image_path = "test.png" # Replace with your image path
78
+ question = "Tell me in detail about the image?"
79
 
80
  # Call the prediction function
81
  answer = predict(image_path, question)
82
  print("Answer:", answer)
83
+ ```
84
 
85
+ ### **Alternative: Use Gradio for Interactive Web Interface**
86
+
87
+ If you prefer an interactive interface, you can use Gradio to deploy the model:
88
+
89
+ ```python
90
  import gradio as gr
91
+ from PIL import Image
92
 
93
+ # Define the prediction function for Gradio
94
  def predict(image, question):
95
  inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
96
  outputs = model.generate(**inputs)
97
  return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
98
 
99
+ # Create the Gradio interface
100
  interface = gr.Interface(
101
  fn=predict,
102
  inputs=["image", "text"],
 
104
  title="Florence 2 VQA - Engineering Drawings",
105
  description="Upload an engineering drawing and ask a related question."
106
  )
107
+
108
+ # Launch the Gradio interface
109
  interface.launch()
110
+ ```
111
+
112
+ ---
113
+
114
+ ## Training Details
115
+ - **Preprocessing**:
116
+ - Images were resized and normalized.
117
+ - Text data (questions and answers) was tokenized using the Florence tokenizer.
118
+ - **Hyperparameters**:
119
+ - **Learning Rate**: `1e-6`
120
+ - **Batch Size**: `2`
121
+ - **Gradient Accumulation Steps**: `4`
122
+ - **Epochs**: `10`
123
+
124
+ Training was performed using mixed precision for efficiency.
125
+
126
+ ---
127
+
128
+ ## Limitations
129
+ - This model is specifically fine-tuned for engineering drawings and may not perform well on general-purpose images or questions.
130
+ - Answers maybe inaccurate.
131
+
132
+ ---
133
+
134
+ ## Acknowledgments
135
+ - **Base Model**: [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
136
+
137
+ For further assistance, feel free to reach out or raise an issue in the repository.
138
+ ```
139
+
140
+ ### Key Sections Added:
141
+
142
+ 1. **Model Loading**: The section explains how to load the model and processor in detail.
143
+ 2. **Prediction Function**: Code to define a function that will use the model for inference on a given image and question.
144
+ 3. **Gradio Interface**: Provides an example of deploying the model using Gradio for interactive usage.
145
+