Spaces:

Netrava
/

omniparser-api

Runtime error

App Files Files Community

Netrava commited on Aug 3, 2025

Commit

0d13a55

verified ·

1 Parent(s): 51450ce

Upload 4 files

Browse files

Files changed (4) hide show

README.md +76 -8
app.py +165 -0
app_simplified.py +100 -1
requirements.txt +2 -1

README.md CHANGED Viewed

@@ -1,28 +1,57 @@
 ---
-title: OmniParser v2.0 API (Simplified)
 emoji: 🖼️
 colorFrom: blue
 colorTo: indigo
 sdk: gradio
 sdk_version: 4.0.0
-app_file: app_simplified.py
 pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-# OmniParser v2.0 API (Simplified Version)
-This is a simplified version of the OmniParser v2.0 API that simulates the functionality without using the actual models. It's provided as a fallback in case the full version has compatibility issues.
 ## Features
-- Simulates parsing UI screenshots into structured JSON data
 - Identifies interactive elements (buttons, menus, icons, etc.)
 - Provides captions describing the functionality of each element
 - Returns visualization of detected elements
 - Accessible via a simple REST API
 ## API Usage
 You can use this API by sending a POST request with a file upload:
@@ -55,8 +84,47 @@ for element in elements:
 visualization_base64 = result["visualization"]
 ```
-## Note
-This is a simplified version that simulates OmniParser functionality. It does not use the actual OmniParser models. The elements detected are generated randomly and do not represent actual UI elements in the image.
-For the full version that uses the actual OmniParser models, please see the main repository.

 ---
+title: OmniParser v2.0 API
 emoji: 🖼️
 colorFrom: blue
 colorTo: indigo
 sdk: gradio
 sdk_version: 4.0.0
+app_file: app.py
 pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# OmniParser v2.0 API
+This is a public API endpoint for Microsoft's OmniParser v2.0, which can parse UI screenshots and return structured data.
 ## Features
+- Parses UI screenshots into structured JSON data
 - Identifies interactive elements (buttons, menus, icons, etc.)
 - Provides captions describing the functionality of each element
 - Returns visualization of detected elements
 - Accessible via a simple REST API
+## Enhancement Opportunities
+The current implementation provides a solid foundation, but there are several opportunities for enhancement:
+### Data Fusion
+- **Current**: YOLO for detection and VLM for captioning are used separately
+- **Enhancement**: Implement a more integrated approach that combines YOLO, VLM, OCR, and SAM
+- **Benefits**: More accurate detection, better context understanding, and more precise segmentation
+### OCR Integration
+- **Current**: OCR is used separately from YOLO detection
+- **Enhancement**: Use OCR results to refine YOLO detections and merge overlapping text and UI elements
+- **Benefits**: Better text recognition in UI elements and improved element classification
+### SAM Integration
+- **Current**: No segmentation model is used
+- **Enhancement**: Integrate SAM (Segment Anything Model) for precise segmentation of UI elements
+- **Benefits**: Better handling of complex UI layouts and irregular-shaped elements
+### Confidence Scoring
+- **Current**: Simple confidence scores from individual models
+- **Enhancement**: Combine confidence scores from multiple models and consider element context
+- **Benefits**: More reliable confidence scores and better prioritization of elements
+### Predictive Monitoring
+- **Current**: No verification of detected elements
+- **Enhancement**: Verify that detected elements make sense in the UI context
+- **Benefits**: Identify missing or incorrectly detected elements and improve detection accuracy
 ## API Usage
 You can use this API by sending a POST request with a file upload:
 visualization_base64 = result["visualization"]
 ```
+## Response Format
+The API returns a JSON object with the following structure:
+```json
+{
+  "status": "success",
+  "elements": [
+    {
+      "id": 0,
+      "text": "Button 1",
+      "caption": "Click to submit form",
+      "coordinates": [0.1, 0.1, 0.3, 0.2],
+      "is_interactable": true,
+      "confidence": 0.95
+    },
+    {
+      "id": 1,
+      "text": "Menu",
+      "caption": "Navigation menu",
+      "coordinates": [0.4, 0.5, 0.6, 0.6],
+      "is_interactable": true,
+      "confidence": 0.87
+    }
+  ],
+  "visualization": "base64_encoded_image_string"
+}
+```
+## Deployment
+This API is deployed on Hugging Face Spaces using Gradio. The deployment is free and provides a public URL that you can use in your applications.
+## Credits
+This API uses Microsoft's OmniParser v2.0, which is a screen parsing tool for pure vision-based GUI agents. For more information, visit the [OmniParser GitHub repository](https://github.com/microsoft/OmniParser).
+## License
+Please note that the OmniParser models have specific licenses:
+- icon_detect model is under AGPL license
+- icon_caption is under MIT license
+Please refer to the LICENSE file in the folder of each model in the original repository.

app.py CHANGED Viewed

@@ -154,13 +154,57 @@ print(f"Using device: {device}")
 # Initialize models with correct paths
 try:
     yolo_model = get_yolo_model(model_path='OmniParser/weights/icon_detect/model.pt')
     caption_model_processor = get_caption_model_processor(
         model_name="florence2",
         model_name_or_path="OmniParser/weights/icon_caption_florence"
     )
     print("Models initialized successfully")
     models_initialized = True
 except Exception as e:
     print(f"Error initializing models: {str(e)}")
     # Create dummy models for graceful failure
@@ -270,6 +314,38 @@ def process_image(
         # Run OCR to detect text
         try:
             ocr_bbox_rslt, is_goal_filtered = check_ocr_box(
                 image,
                 display_img=False,
@@ -291,6 +367,41 @@ def process_image(
         # Process image with OmniParser
         try:
             dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
                 image,
                 yolo_model,
@@ -315,6 +426,31 @@ def process_image(
             # Create structured output
             elements = []
             for i, element in enumerate(parsed_content_list):
                 elements.append({
                     "id": i,
                     "text": element.get("text", ""),
@@ -324,6 +460,35 @@ def process_image(
                     "confidence": element.get("confidence", 0.0)
                 })
             # Return structured data and visualization
             return {
                 "elements": elements,

 # Initialize models with correct paths
 try:
+    # YOLO model for object detection
     yolo_model = get_yolo_model(model_path='OmniParser/weights/icon_detect/model.pt')
+    # VLM (Vision Language Model) for captioning
     caption_model_processor = get_caption_model_processor(
         model_name="florence2",
         model_name_or_path="OmniParser/weights/icon_caption_florence"
     )
     print("Models initialized successfully")
     models_initialized = True
+    # ENHANCEMENT OPPORTUNITY: Data Fusion
+    # The current implementation uses YOLO for detection and VLM for captioning separately.
+    # A more integrated approach could:
+    # 1. Use YOLO for initial detection of UI elements
+    # 2. Use VLM to refine the detections and provide more context
+    # 3. Implement a confidence-based merging strategy for overlapping detections
+    # 4. Use SAM (Segment Anything Model) for more precise segmentation of UI elements
+    #
+    # Example implementation:
+    # ```
+    # def enhanced_detection(image, yolo_model, vlm_model, sam_model):
+    #     # Get YOLO detections
+    #     yolo_boxes = yolo_model(image)
+    #
+    #     # Use VLM to analyze the entire image for context
+    #     global_context = vlm_model.analyze_image(image)
+    #
+    #     # For each YOLO box, use VLM to get more detailed information
+    #     refined_detections = []
+    #     for box in yolo_boxes:
+    #         # Crop the region
+    #         region = crop_image(image, box)
+    #
+    #         # Get VLM description
+    #         description = vlm_model.describe_region(region, context=global_context)
+    #
+    #         # Use SAM for precise segmentation
+    #         mask = sam_model.segment(image, box)
+    #
+    #         refined_detections.append({
+    #             "box": box,
+    #             "description": description,
+    #             "mask": mask,
+    #             "confidence": combine_confidence(box.conf, description.conf)
+    #         })
+    #
+    #     return refined_detections
+    # ```
 except Exception as e:
     print(f"Error initializing models: {str(e)}")
     # Create dummy models for graceful failure
         # Run OCR to detect text
         try:
+            # ENHANCEMENT OPPORTUNITY: OCR Integration
+            # The current implementation uses OCR separately from YOLO detection.
+            # A more integrated approach could:
+            # 1. Use OCR results to refine YOLO detections
+            # 2. Merge overlapping text and UI element detections
+            # 3. Use text content to improve element classification
+            #
+            # Example implementation:
+            # ```
+            # def integrated_ocr_detection(image, ocr_results, yolo_detections):
+            #     merged_detections = []
+            #
+            #     # For each YOLO detection
+            #     for yolo_box in yolo_detections:
+            #         # Find overlapping OCR text
+            #         overlapping_text = []
+            #         for text, text_box in ocr_results:
+            #             if calculate_iou(yolo_box, text_box) > threshold:
+            #                 overlapping_text.append(text)
+            #
+            #         # Use text content to refine element classification
+            #         element_type = classify_element_with_text(yolo_box, overlapping_text)
+            #
+            #         merged_detections.append({
+            #             "box": yolo_box,
+            #             "text": " ".join(overlapping_text),
+            #             "type": element_type
+            #         })
+            #
+            #     return merged_detections
+            # ```
             ocr_bbox_rslt, is_goal_filtered = check_ocr_box(
                 image,
                 display_img=False,
         # Process image with OmniParser
         try:
+            # ENHANCEMENT OPPORTUNITY: SAM Integration
+            # The current implementation doesn't use SAM (Segment Anything Model).
+            # Integrating SAM could:
+            # 1. Provide more precise segmentation of UI elements
+            # 2. Better handle complex UI layouts with overlapping elements
+            # 3. Improve detection of irregular-shaped elements
+            #
+            # Example implementation:
+            # ```
+            # def integrate_sam(image, boxes, sam_model):
+            #     # Initialize SAM predictor
+            #     predictor = SamPredictor(sam_model)
+            #     predictor.set_image(np.array(image))
+            #
+            #     refined_elements = []
+            #     for box in boxes:
+            #         # Convert box to SAM input format
+            #         input_box = np.array([box[0], box[1], box[2], box[3]])
+            #
+            #         # Get SAM mask
+            #         masks, scores, _ = predictor.predict(
+            #             box=input_box,
+            #             multimask_output=False
+            #         )
+            #
+            #         # Use the mask to refine the element boundaries
+            #         refined_elements.append({
+            #             "box": box,
+            #             "mask": masks[0],
+            #             "mask_confidence": scores[0]
+            #         })
+            #
+            #     return refined_elements
+            # ```
             dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
                 image,
                 yolo_model,
             # Create structured output
             elements = []
             for i, element in enumerate(parsed_content_list):
+                # ENHANCEMENT OPPORTUNITY: Confidence Scoring
+                # The current implementation uses a simple confidence score.
+                # A more sophisticated approach could:
+                # 1. Combine confidence scores from multiple models (YOLO, VLM, OCR)
+                # 2. Consider element context and relationships
+                # 3. Use historical data to improve confidence scoring
+                #
+                # Example implementation:
+                # ```
+                # def calculate_confidence(yolo_conf, vlm_conf, ocr_conf, element_type):
+                #     # Base confidence from YOLO
+                #     base_conf = yolo_conf
+                #
+                #     # Adjust based on VLM confidence
+                #     if vlm_conf > 0.8:
+                #         base_conf = (base_conf + vlm_conf) / 2
+                #
+                #     # Adjust based on element type
+                #     if element_type == "button" and ocr_conf > 0.9:
+                #         base_conf = (base_conf + ocr_conf) / 2
+                #
+                #     # Normalize to 0-1 range
+                #     return min(1.0, base_conf)
+                # ```
                 elements.append({
                     "id": i,
                     "text": element.get("text", ""),
                     "confidence": element.get("confidence", 0.0)
                 })
+            # ENHANCEMENT OPPORTUNITY: Predictive Monitoring
+            # The current implementation doesn't include predictive monitoring.
+            # Adding this could:
+            # 1. Verify that detected elements make sense in the UI context
+            # 2. Identify missing or incorrectly detected elements
+            # 3. Provide feedback for improving detection accuracy
+            #
+            # Example implementation:
+            # ```
+            # def verify_detections(elements, image, vlm_model):
+            #     # Use VLM to analyze the entire image
+            #     global_description = vlm_model.describe_image(image)
+            #
+            #     # Check if detected elements match the global description
+            #     expected_elements = extract_expected_elements(global_description)
+            #
+            #     # Compare detected vs expected
+            #     missing_elements = [e for e in expected_elements if not any(
+            #         similar_element(e, detected) for detected in elements
+            #     )]
+            #
+            #     # Provide feedback
+            #     return {
+            #         "verified_elements": elements,
+            #         "missing_elements": missing_elements,
+            #         "confidence": calculate_overall_confidence(elements, expected_elements)
+            #     }
+            # ```
             # Return structured data and visualization
             return {
                 "elements": elements,

app_simplified.py CHANGED Viewed

@@ -30,6 +30,47 @@ def process_image(image):
     # Define some mock UI element types
     element_types = ["Button", "Text Field", "Checkbox", "Dropdown", "Menu Item", "Icon", "Link"]
     # Generate some random elements
     elements = []
     num_elements = min(15, int(image.width * image.height / 40000))  # Scale with image size
@@ -57,6 +98,34 @@ def process_image(image):
         text = random.choice(captions[element_type])
         caption = f"{element_type}: {text}"
         # Add to elements list
         elements.append({
             "id": i,
@@ -71,10 +140,40 @@ def process_image(image):
         draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
         draw.text((x1, y1 - 10), f"{i}: {text}", fill="red")
     return {
         "elements": elements,
         "visualization": vis_img,
-        "note": "This is a simplified implementation that simulates OmniParser functionality."
     }
 # API endpoint function

     # Define some mock UI element types
     element_types = ["Button", "Text Field", "Checkbox", "Dropdown", "Menu Item", "Icon", "Link"]
+    # ENHANCEMENT OPPORTUNITY: Data Fusion
+    # In a real implementation, we would integrate multiple models:
+    # 1. YOLO for initial detection of UI elements
+    # 2. OCR for text detection
+    # 3. VLM for captioning and context understanding
+    # 4. SAM for precise segmentation
+    #
+    # Example architecture:
+    # ```
+    # def integrated_detection(image):
+    #     # 1. Run YOLO to detect UI elements
+    #     yolo_boxes = yolo_model(image)
+    #
+    #     # 2. Run OCR to detect text
+    #     ocr_results = ocr_model(image)
+    #
+    #     # 3. Use VLM to understand the overall context
+    #     context = vlm_model.analyze_image(image)
+    #
+    #     # 4. For each detected element, use SAM for precise segmentation
+    #     elements = []
+    #     for box in yolo_boxes:
+    #         # Get SAM mask
+    #         mask = sam_model.segment(image, box)
+    #
+    #         # Find overlapping text from OCR
+    #         element_text = find_overlapping_text(box, ocr_results)
+    #
+    #         # Use VLM to caption the element with context
+    #         caption = vlm_model.caption_region(image, box, context)
+    #
+    #         elements.append({
+    #             "box": box,
+    #             "mask": mask,
+    #             "text": element_text,
+    #             "caption": caption
+    #         })
+    #
+    #     return elements
+    # ```
     # Generate some random elements
     elements = []
     num_elements = min(15, int(image.width * image.height / 40000))  # Scale with image size
         text = random.choice(captions[element_type])
         caption = f"{element_type}: {text}"
+        # ENHANCEMENT OPPORTUNITY: Confidence Scoring
+        # In a real implementation, confidence scores would be calculated based on:
+        # 1. Detection confidence from YOLO
+        # 2. Text recognition confidence from OCR
+        # 3. Caption confidence from VLM
+        # 4. Segmentation confidence from SAM
+        #
+        # Example implementation:
+        # ```
+        # def calculate_confidence(detection_conf, ocr_conf, vlm_conf, sam_conf):
+        #     # Weighted average of confidence scores
+        #     weights = {
+        #         "detection": 0.4,
+        #         "ocr": 0.2,
+        #         "vlm": 0.3,
+        #         "sam": 0.1
+        #     }
+        #
+        #     confidence = (
+        #         weights["detection"] * detection_conf +
+        #         weights["ocr"] * ocr_conf +
+        #         weights["vlm"] * vlm_conf +
+        #         weights["sam"] * sam_conf
+        #     )
+        #
+        #     return confidence
+        # ```
         # Add to elements list
         elements.append({
             "id": i,
         draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
         draw.text((x1, y1 - 10), f"{i}: {text}", fill="red")
+    # ENHANCEMENT OPPORTUNITY: Predictive Monitoring
+    # In a real implementation, we would verify the detected elements:
+    # 1. Check if the detected elements make sense in the UI context
+    # 2. Verify that interactive elements have appropriate labels
+    # 3. Ensure that the UI structure is coherent
+    #
+    # Example implementation:
+    # ```
+    # def verify_ui_elements(elements, image):
+    #     # Use VLM to analyze the entire UI
+    #     ui_analysis = vlm_model.analyze_ui(image)
+    #
+    #     # Check if detected elements match the expected UI structure
+    #     verified_elements = []
+    #     for element in elements:
+    #         # Verify element type based on appearance and context
+    #         verified_type = verify_element_type(element, ui_analysis)
+    #
+    #         # Verify interactability
+    #         verified_interactable = verify_interactability(element, verified_type)
+    #
+    #         verified_elements.append({
+    #             **element,
+    #             "verified_type": verified_type,
+    #             "verified_interactable": verified_interactable
+    #         })
+    #
+    #     return verified_elements
+    # ```
     return {
         "elements": elements,
         "visualization": vis_img,
+        "note": "This is a simplified implementation that simulates OmniParser functionality. For a real implementation, consider integrating YOLO, VLM, OCR, and SAM models as described in the code comments."
     }
 # API endpoint function

requirements.txt CHANGED Viewed

@@ -7,7 +7,8 @@ numpy>=1.24.0
 easyocr>=1.7.0
 # Use a specific version of paddleocr that works with our patch
 paddleocr==2.6.0.3
-paddlepaddle==2.4.2
 opencv-python>=4.7.0
 huggingface_hub>=0.16.0
 peft>=0.4.0

 easyocr>=1.7.0
 # Use a specific version of paddleocr that works with our patch
 paddleocr==2.6.0.3
+# Use a version of paddlepaddle that is available
+paddlepaddle>=2.5.0
 opencv-python>=4.7.0
 huggingface_hub>=0.16.0
 peft>=0.4.0