Netrava commited on
Commit
0d13a55
·
verified ·
1 Parent(s): 51450ce

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +76 -8
  2. app.py +165 -0
  3. app_simplified.py +100 -1
  4. requirements.txt +2 -1
README.md CHANGED
@@ -1,28 +1,57 @@
1
  ---
2
- title: OmniParser v2.0 API (Simplified)
3
  emoji: 🖼️
4
  colorFrom: blue
5
  colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 4.0.0
8
- app_file: app_simplified.py
9
  pinned: false
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
 
14
- # OmniParser v2.0 API (Simplified Version)
15
 
16
- This is a simplified version of the OmniParser v2.0 API that simulates the functionality without using the actual models. It's provided as a fallback in case the full version has compatibility issues.
17
 
18
  ## Features
19
 
20
- - Simulates parsing UI screenshots into structured JSON data
21
  - Identifies interactive elements (buttons, menus, icons, etc.)
22
  - Provides captions describing the functionality of each element
23
  - Returns visualization of detected elements
24
  - Accessible via a simple REST API
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## API Usage
27
 
28
  You can use this API by sending a POST request with a file upload:
@@ -55,8 +84,47 @@ for element in elements:
55
  visualization_base64 = result["visualization"]
56
  ```
57
 
58
- ## Note
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
- This is a simplified version that simulates OmniParser functionality. It does not use the actual OmniParser models. The elements detected are generated randomly and do not represent actual UI elements in the image.
 
 
61
 
62
- For the full version that uses the actual OmniParser models, please see the main repository.
 
1
  ---
2
+ title: OmniParser v2.0 API
3
  emoji: 🖼️
4
  colorFrom: blue
5
  colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 4.0.0
8
+ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
 
14
+ # OmniParser v2.0 API
15
 
16
+ This is a public API endpoint for Microsoft's OmniParser v2.0, which can parse UI screenshots and return structured data.
17
 
18
  ## Features
19
 
20
+ - Parses UI screenshots into structured JSON data
21
  - Identifies interactive elements (buttons, menus, icons, etc.)
22
  - Provides captions describing the functionality of each element
23
  - Returns visualization of detected elements
24
  - Accessible via a simple REST API
25
 
26
+ ## Enhancement Opportunities
27
+
28
+ The current implementation provides a solid foundation, but there are several opportunities for enhancement:
29
+
30
+ ### Data Fusion
31
+ - **Current**: YOLO for detection and VLM for captioning are used separately
32
+ - **Enhancement**: Implement a more integrated approach that combines YOLO, VLM, OCR, and SAM
33
+ - **Benefits**: More accurate detection, better context understanding, and more precise segmentation
34
+
35
+ ### OCR Integration
36
+ - **Current**: OCR is used separately from YOLO detection
37
+ - **Enhancement**: Use OCR results to refine YOLO detections and merge overlapping text and UI elements
38
+ - **Benefits**: Better text recognition in UI elements and improved element classification
39
+
40
+ ### SAM Integration
41
+ - **Current**: No segmentation model is used
42
+ - **Enhancement**: Integrate SAM (Segment Anything Model) for precise segmentation of UI elements
43
+ - **Benefits**: Better handling of complex UI layouts and irregular-shaped elements
44
+
45
+ ### Confidence Scoring
46
+ - **Current**: Simple confidence scores from individual models
47
+ - **Enhancement**: Combine confidence scores from multiple models and consider element context
48
+ - **Benefits**: More reliable confidence scores and better prioritization of elements
49
+
50
+ ### Predictive Monitoring
51
+ - **Current**: No verification of detected elements
52
+ - **Enhancement**: Verify that detected elements make sense in the UI context
53
+ - **Benefits**: Identify missing or incorrectly detected elements and improve detection accuracy
54
+
55
  ## API Usage
56
 
57
  You can use this API by sending a POST request with a file upload:
 
84
  visualization_base64 = result["visualization"]
85
  ```
86
 
87
+ ## Response Format
88
+
89
+ The API returns a JSON object with the following structure:
90
+
91
+ ```json
92
+ {
93
+ "status": "success",
94
+ "elements": [
95
+ {
96
+ "id": 0,
97
+ "text": "Button 1",
98
+ "caption": "Click to submit form",
99
+ "coordinates": [0.1, 0.1, 0.3, 0.2],
100
+ "is_interactable": true,
101
+ "confidence": 0.95
102
+ },
103
+ {
104
+ "id": 1,
105
+ "text": "Menu",
106
+ "caption": "Navigation menu",
107
+ "coordinates": [0.4, 0.5, 0.6, 0.6],
108
+ "is_interactable": true,
109
+ "confidence": 0.87
110
+ }
111
+ ],
112
+ "visualization": "base64_encoded_image_string"
113
+ }
114
+ ```
115
+
116
+ ## Deployment
117
+
118
+ This API is deployed on Hugging Face Spaces using Gradio. The deployment is free and provides a public URL that you can use in your applications.
119
+
120
+ ## Credits
121
+
122
+ This API uses Microsoft's OmniParser v2.0, which is a screen parsing tool for pure vision-based GUI agents. For more information, visit the [OmniParser GitHub repository](https://github.com/microsoft/OmniParser).
123
+
124
+ ## License
125
 
126
+ Please note that the OmniParser models have specific licenses:
127
+ - icon_detect model is under AGPL license
128
+ - icon_caption is under MIT license
129
 
130
+ Please refer to the LICENSE file in the folder of each model in the original repository.
app.py CHANGED
@@ -154,13 +154,57 @@ print(f"Using device: {device}")
154
 
155
  # Initialize models with correct paths
156
  try:
 
157
  yolo_model = get_yolo_model(model_path='OmniParser/weights/icon_detect/model.pt')
 
 
158
  caption_model_processor = get_caption_model_processor(
159
  model_name="florence2",
160
  model_name_or_path="OmniParser/weights/icon_caption_florence"
161
  )
 
162
  print("Models initialized successfully")
163
  models_initialized = True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
  except Exception as e:
165
  print(f"Error initializing models: {str(e)}")
166
  # Create dummy models for graceful failure
@@ -270,6 +314,38 @@ def process_image(
270
 
271
  # Run OCR to detect text
272
  try:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
273
  ocr_bbox_rslt, is_goal_filtered = check_ocr_box(
274
  image,
275
  display_img=False,
@@ -291,6 +367,41 @@ def process_image(
291
 
292
  # Process image with OmniParser
293
  try:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294
  dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
295
  image,
296
  yolo_model,
@@ -315,6 +426,31 @@ def process_image(
315
  # Create structured output
316
  elements = []
317
  for i, element in enumerate(parsed_content_list):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
318
  elements.append({
319
  "id": i,
320
  "text": element.get("text", ""),
@@ -324,6 +460,35 @@ def process_image(
324
  "confidence": element.get("confidence", 0.0)
325
  })
326
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
327
  # Return structured data and visualization
328
  return {
329
  "elements": elements,
 
154
 
155
  # Initialize models with correct paths
156
  try:
157
+ # YOLO model for object detection
158
  yolo_model = get_yolo_model(model_path='OmniParser/weights/icon_detect/model.pt')
159
+
160
+ # VLM (Vision Language Model) for captioning
161
  caption_model_processor = get_caption_model_processor(
162
  model_name="florence2",
163
  model_name_or_path="OmniParser/weights/icon_caption_florence"
164
  )
165
+
166
  print("Models initialized successfully")
167
  models_initialized = True
168
+
169
+ # ENHANCEMENT OPPORTUNITY: Data Fusion
170
+ # The current implementation uses YOLO for detection and VLM for captioning separately.
171
+ # A more integrated approach could:
172
+ # 1. Use YOLO for initial detection of UI elements
173
+ # 2. Use VLM to refine the detections and provide more context
174
+ # 3. Implement a confidence-based merging strategy for overlapping detections
175
+ # 4. Use SAM (Segment Anything Model) for more precise segmentation of UI elements
176
+ #
177
+ # Example implementation:
178
+ # ```
179
+ # def enhanced_detection(image, yolo_model, vlm_model, sam_model):
180
+ # # Get YOLO detections
181
+ # yolo_boxes = yolo_model(image)
182
+ #
183
+ # # Use VLM to analyze the entire image for context
184
+ # global_context = vlm_model.analyze_image(image)
185
+ #
186
+ # # For each YOLO box, use VLM to get more detailed information
187
+ # refined_detections = []
188
+ # for box in yolo_boxes:
189
+ # # Crop the region
190
+ # region = crop_image(image, box)
191
+ #
192
+ # # Get VLM description
193
+ # description = vlm_model.describe_region(region, context=global_context)
194
+ #
195
+ # # Use SAM for precise segmentation
196
+ # mask = sam_model.segment(image, box)
197
+ #
198
+ # refined_detections.append({
199
+ # "box": box,
200
+ # "description": description,
201
+ # "mask": mask,
202
+ # "confidence": combine_confidence(box.conf, description.conf)
203
+ # })
204
+ #
205
+ # return refined_detections
206
+ # ```
207
+
208
  except Exception as e:
209
  print(f"Error initializing models: {str(e)}")
210
  # Create dummy models for graceful failure
 
314
 
315
  # Run OCR to detect text
316
  try:
317
+ # ENHANCEMENT OPPORTUNITY: OCR Integration
318
+ # The current implementation uses OCR separately from YOLO detection.
319
+ # A more integrated approach could:
320
+ # 1. Use OCR results to refine YOLO detections
321
+ # 2. Merge overlapping text and UI element detections
322
+ # 3. Use text content to improve element classification
323
+ #
324
+ # Example implementation:
325
+ # ```
326
+ # def integrated_ocr_detection(image, ocr_results, yolo_detections):
327
+ # merged_detections = []
328
+ #
329
+ # # For each YOLO detection
330
+ # for yolo_box in yolo_detections:
331
+ # # Find overlapping OCR text
332
+ # overlapping_text = []
333
+ # for text, text_box in ocr_results:
334
+ # if calculate_iou(yolo_box, text_box) > threshold:
335
+ # overlapping_text.append(text)
336
+ #
337
+ # # Use text content to refine element classification
338
+ # element_type = classify_element_with_text(yolo_box, overlapping_text)
339
+ #
340
+ # merged_detections.append({
341
+ # "box": yolo_box,
342
+ # "text": " ".join(overlapping_text),
343
+ # "type": element_type
344
+ # })
345
+ #
346
+ # return merged_detections
347
+ # ```
348
+
349
  ocr_bbox_rslt, is_goal_filtered = check_ocr_box(
350
  image,
351
  display_img=False,
 
367
 
368
  # Process image with OmniParser
369
  try:
370
+ # ENHANCEMENT OPPORTUNITY: SAM Integration
371
+ # The current implementation doesn't use SAM (Segment Anything Model).
372
+ # Integrating SAM could:
373
+ # 1. Provide more precise segmentation of UI elements
374
+ # 2. Better handle complex UI layouts with overlapping elements
375
+ # 3. Improve detection of irregular-shaped elements
376
+ #
377
+ # Example implementation:
378
+ # ```
379
+ # def integrate_sam(image, boxes, sam_model):
380
+ # # Initialize SAM predictor
381
+ # predictor = SamPredictor(sam_model)
382
+ # predictor.set_image(np.array(image))
383
+ #
384
+ # refined_elements = []
385
+ # for box in boxes:
386
+ # # Convert box to SAM input format
387
+ # input_box = np.array([box[0], box[1], box[2], box[3]])
388
+ #
389
+ # # Get SAM mask
390
+ # masks, scores, _ = predictor.predict(
391
+ # box=input_box,
392
+ # multimask_output=False
393
+ # )
394
+ #
395
+ # # Use the mask to refine the element boundaries
396
+ # refined_elements.append({
397
+ # "box": box,
398
+ # "mask": masks[0],
399
+ # "mask_confidence": scores[0]
400
+ # })
401
+ #
402
+ # return refined_elements
403
+ # ```
404
+
405
  dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
406
  image,
407
  yolo_model,
 
426
  # Create structured output
427
  elements = []
428
  for i, element in enumerate(parsed_content_list):
429
+ # ENHANCEMENT OPPORTUNITY: Confidence Scoring
430
+ # The current implementation uses a simple confidence score.
431
+ # A more sophisticated approach could:
432
+ # 1. Combine confidence scores from multiple models (YOLO, VLM, OCR)
433
+ # 2. Consider element context and relationships
434
+ # 3. Use historical data to improve confidence scoring
435
+ #
436
+ # Example implementation:
437
+ # ```
438
+ # def calculate_confidence(yolo_conf, vlm_conf, ocr_conf, element_type):
439
+ # # Base confidence from YOLO
440
+ # base_conf = yolo_conf
441
+ #
442
+ # # Adjust based on VLM confidence
443
+ # if vlm_conf > 0.8:
444
+ # base_conf = (base_conf + vlm_conf) / 2
445
+ #
446
+ # # Adjust based on element type
447
+ # if element_type == "button" and ocr_conf > 0.9:
448
+ # base_conf = (base_conf + ocr_conf) / 2
449
+ #
450
+ # # Normalize to 0-1 range
451
+ # return min(1.0, base_conf)
452
+ # ```
453
+
454
  elements.append({
455
  "id": i,
456
  "text": element.get("text", ""),
 
460
  "confidence": element.get("confidence", 0.0)
461
  })
462
 
463
+ # ENHANCEMENT OPPORTUNITY: Predictive Monitoring
464
+ # The current implementation doesn't include predictive monitoring.
465
+ # Adding this could:
466
+ # 1. Verify that detected elements make sense in the UI context
467
+ # 2. Identify missing or incorrectly detected elements
468
+ # 3. Provide feedback for improving detection accuracy
469
+ #
470
+ # Example implementation:
471
+ # ```
472
+ # def verify_detections(elements, image, vlm_model):
473
+ # # Use VLM to analyze the entire image
474
+ # global_description = vlm_model.describe_image(image)
475
+ #
476
+ # # Check if detected elements match the global description
477
+ # expected_elements = extract_expected_elements(global_description)
478
+ #
479
+ # # Compare detected vs expected
480
+ # missing_elements = [e for e in expected_elements if not any(
481
+ # similar_element(e, detected) for detected in elements
482
+ # )]
483
+ #
484
+ # # Provide feedback
485
+ # return {
486
+ # "verified_elements": elements,
487
+ # "missing_elements": missing_elements,
488
+ # "confidence": calculate_overall_confidence(elements, expected_elements)
489
+ # }
490
+ # ```
491
+
492
  # Return structured data and visualization
493
  return {
494
  "elements": elements,
app_simplified.py CHANGED
@@ -30,6 +30,47 @@ def process_image(image):
30
  # Define some mock UI element types
31
  element_types = ["Button", "Text Field", "Checkbox", "Dropdown", "Menu Item", "Icon", "Link"]
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  # Generate some random elements
34
  elements = []
35
  num_elements = min(15, int(image.width * image.height / 40000)) # Scale with image size
@@ -57,6 +98,34 @@ def process_image(image):
57
  text = random.choice(captions[element_type])
58
  caption = f"{element_type}: {text}"
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  # Add to elements list
61
  elements.append({
62
  "id": i,
@@ -71,10 +140,40 @@ def process_image(image):
71
  draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
72
  draw.text((x1, y1 - 10), f"{i}: {text}", fill="red")
73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  return {
75
  "elements": elements,
76
  "visualization": vis_img,
77
- "note": "This is a simplified implementation that simulates OmniParser functionality."
78
  }
79
 
80
  # API endpoint function
 
30
  # Define some mock UI element types
31
  element_types = ["Button", "Text Field", "Checkbox", "Dropdown", "Menu Item", "Icon", "Link"]
32
 
33
+ # ENHANCEMENT OPPORTUNITY: Data Fusion
34
+ # In a real implementation, we would integrate multiple models:
35
+ # 1. YOLO for initial detection of UI elements
36
+ # 2. OCR for text detection
37
+ # 3. VLM for captioning and context understanding
38
+ # 4. SAM for precise segmentation
39
+ #
40
+ # Example architecture:
41
+ # ```
42
+ # def integrated_detection(image):
43
+ # # 1. Run YOLO to detect UI elements
44
+ # yolo_boxes = yolo_model(image)
45
+ #
46
+ # # 2. Run OCR to detect text
47
+ # ocr_results = ocr_model(image)
48
+ #
49
+ # # 3. Use VLM to understand the overall context
50
+ # context = vlm_model.analyze_image(image)
51
+ #
52
+ # # 4. For each detected element, use SAM for precise segmentation
53
+ # elements = []
54
+ # for box in yolo_boxes:
55
+ # # Get SAM mask
56
+ # mask = sam_model.segment(image, box)
57
+ #
58
+ # # Find overlapping text from OCR
59
+ # element_text = find_overlapping_text(box, ocr_results)
60
+ #
61
+ # # Use VLM to caption the element with context
62
+ # caption = vlm_model.caption_region(image, box, context)
63
+ #
64
+ # elements.append({
65
+ # "box": box,
66
+ # "mask": mask,
67
+ # "text": element_text,
68
+ # "caption": caption
69
+ # })
70
+ #
71
+ # return elements
72
+ # ```
73
+
74
  # Generate some random elements
75
  elements = []
76
  num_elements = min(15, int(image.width * image.height / 40000)) # Scale with image size
 
98
  text = random.choice(captions[element_type])
99
  caption = f"{element_type}: {text}"
100
 
101
+ # ENHANCEMENT OPPORTUNITY: Confidence Scoring
102
+ # In a real implementation, confidence scores would be calculated based on:
103
+ # 1. Detection confidence from YOLO
104
+ # 2. Text recognition confidence from OCR
105
+ # 3. Caption confidence from VLM
106
+ # 4. Segmentation confidence from SAM
107
+ #
108
+ # Example implementation:
109
+ # ```
110
+ # def calculate_confidence(detection_conf, ocr_conf, vlm_conf, sam_conf):
111
+ # # Weighted average of confidence scores
112
+ # weights = {
113
+ # "detection": 0.4,
114
+ # "ocr": 0.2,
115
+ # "vlm": 0.3,
116
+ # "sam": 0.1
117
+ # }
118
+ #
119
+ # confidence = (
120
+ # weights["detection"] * detection_conf +
121
+ # weights["ocr"] * ocr_conf +
122
+ # weights["vlm"] * vlm_conf +
123
+ # weights["sam"] * sam_conf
124
+ # )
125
+ #
126
+ # return confidence
127
+ # ```
128
+
129
  # Add to elements list
130
  elements.append({
131
  "id": i,
 
140
  draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
141
  draw.text((x1, y1 - 10), f"{i}: {text}", fill="red")
142
 
143
+ # ENHANCEMENT OPPORTUNITY: Predictive Monitoring
144
+ # In a real implementation, we would verify the detected elements:
145
+ # 1. Check if the detected elements make sense in the UI context
146
+ # 2. Verify that interactive elements have appropriate labels
147
+ # 3. Ensure that the UI structure is coherent
148
+ #
149
+ # Example implementation:
150
+ # ```
151
+ # def verify_ui_elements(elements, image):
152
+ # # Use VLM to analyze the entire UI
153
+ # ui_analysis = vlm_model.analyze_ui(image)
154
+ #
155
+ # # Check if detected elements match the expected UI structure
156
+ # verified_elements = []
157
+ # for element in elements:
158
+ # # Verify element type based on appearance and context
159
+ # verified_type = verify_element_type(element, ui_analysis)
160
+ #
161
+ # # Verify interactability
162
+ # verified_interactable = verify_interactability(element, verified_type)
163
+ #
164
+ # verified_elements.append({
165
+ # **element,
166
+ # "verified_type": verified_type,
167
+ # "verified_interactable": verified_interactable
168
+ # })
169
+ #
170
+ # return verified_elements
171
+ # ```
172
+
173
  return {
174
  "elements": elements,
175
  "visualization": vis_img,
176
+ "note": "This is a simplified implementation that simulates OmniParser functionality. For a real implementation, consider integrating YOLO, VLM, OCR, and SAM models as described in the code comments."
177
  }
178
 
179
  # API endpoint function
requirements.txt CHANGED
@@ -7,7 +7,8 @@ numpy>=1.24.0
7
  easyocr>=1.7.0
8
  # Use a specific version of paddleocr that works with our patch
9
  paddleocr==2.6.0.3
10
- paddlepaddle==2.4.2
 
11
  opencv-python>=4.7.0
12
  huggingface_hub>=0.16.0
13
  peft>=0.4.0
 
7
  easyocr>=1.7.0
8
  # Use a specific version of paddleocr that works with our patch
9
  paddleocr==2.6.0.3
10
+ # Use a version of paddlepaddle that is available
11
+ paddlepaddle>=2.5.0
12
  opencv-python>=4.7.0
13
  huggingface_hub>=0.16.0
14
  peft>=0.4.0