File size: 14,806 Bytes
7faba42
552b7f5
5dae26f
7faba42
5dae26f
7faba42
 
 
 
 
 
 
7d59523
65f3818
 
7d59523
 
 
3253f8e
5992fad
 
04b9fba
 
286b758
 
3253f8e
 
65f3818
3253f8e
 
 
 
 
 
 
04b9fba
3253f8e
 
 
 
 
 
 
 
 
 
 
 
04b9fba
3253f8e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d59523
 
3253f8e
 
 
 
 
 
 
 
7d59523
3253f8e
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
title: OmniScience -- Agentic Imaging Analysis
emoji: 🔬🧫
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: openrail
---

# Using Landing AI's Vision Agent to architect an app for brain tumor detection

- a quick overview of the inner workings of LandingAI's Vision Agent, how it breaks down an initial user requirement to identify candidate components in the application architecture.
- the diagram below captures what I had in mind for a multi-agent system implementation -- but LandingAI's vision agent starts this much earlier, taking a fresh approach on old school architecture trade-off analysis.
- the design-time flow in the most recent version of the official [Vision Agent](https://va.landing.ai/agent) app has shifted somewhat, but the number of concepts it helped bring together for me was amazing.
- if you want a deeper understanding of the run-time flow of the application I encourage you to instrument it with Weave.  Additional information on how to instrument the app can be found in [this GitHub repo](https://github.com/donbr/vision-agent).

![image/png](https://cdn-uploads.huggingface.co/production/uploads/653d62fab16f657d28ce2cf2/KPV1Szj6IkY457n3Hqjl6.png)

# Summary of Vision Agent design-time workflow

MY PROMPT:  use detection (bounding box) and segmentation (segmenation and mask) techniques to detect brain tumors in the image.

## Creating instructions

- Load the image from the given file path '/home/user/xxxx.jpg'.
- Use the 'owl_v2' tool to detect brain tumors in the image. The prompt should be 'brain tumor'.
- Use the 'grounding_sam' tool to segment brain tumors in the image. The prompt should be 'brain tumor'.
- Overlay the bounding boxes from the detection results on the original image using the 'overlay_bounding_boxes' utility.
- Overlay the segmentation masks from the segmentation results on the original image using the 'overlay_segmentation_masks' utility.
- Save the final image with both bounding boxes and segmentation masks to a specified output path.


## Retrieving Vision Agent tools

- 'load_image' is a utility function that loads an image from the given file path string.
'save_image' is a utility function that saves an image to a file path.
- 'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.
- 'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes.
- 'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores.
- 'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores.
- 'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image.
- 'overlay_heat_map' is a utility function that displays a heat map on an image.
- 'overlay_segmentation_masks' is a utility function that displays segmentation masks.


### Retrieving tools - detailed notes from Vision Agent tool selection

load_image(image_path: str) -> numpy.ndarray:
'load_image' is a utility function that loads an image from the given file path string.

    Parameters:
        image_path (str): The path to the image.

    Returns:
        np.ndarray: The image as a NumPy array.

    Example
    -------
        >>> load_image("path/to/image.jpg")


save_image(image: numpy.ndarray, file_path: str) -> None:
'save_image' is a utility function that saves an image to a file path.

    Parameters:
        image (np.ndarray): The image to save.
        file_path (str): The path to save the image file.

    Example
    -------
        >>> save_image(image)

owl_v2(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, iou_threshold: float = 0.1) -> List[Dict[str, Any]]:
'owl_v2' is a tool that can detect and count multiple objects given a text
    prompt such as category names or referring expressions. The categories in text prompt
    are separated by commas. It returns a list of bounding boxes with
    normalized coordinates, label names and associated probability scores.

    Parameters:
        prompt (str): The prompt to ground to the image.
        image (np.ndarray): The image to ground the prompt to.
        box_threshold (float, optional): The threshold for the box detection. Defaults
            to 0.10.
        iou_threshold (float, optional): The threshold for the Intersection over Union
            (IoU). Defaults to 0.10.

    Returns:
        List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
            bounding box of the detected objects with normalized coordinates between 0
            and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the
            top-left and xmax and ymax are the coordinates of the bottom-right of the
            bounding box.

    Example
    -------
        >>> owl_v2("car. dinosaur", image)
        [
            {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]},
            {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
        ]

florencev2_object_detection(image: numpy.ndarray) -> List[Dict[str, Any]]:
'florencev2_object_detection' is a tool that can detect common objects in an
    image without any text prompt or thresholding. It returns a list of detected objects
    as labels and their location as bounding boxes.

    Parameters:
        image (np.ndarray): The image to used to detect objects

    Returns:
        List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
            bounding box of the detected objects with normalized coordinates between 0
            and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the
            top-left and xmax and ymax are the coordinates of the bottom-right of the
            bounding box. The scores are always 1.0 and cannot be thresholded

    Example
    -------
        >>> florencev2_object_detection(image)
        [
            {'score': 1.0, 'label': 'window', 'bbox': [0.1, 0.11, 0.35, 0.4]},
            {'score': 1.0, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
            {'score': 1.0, 'label': 'person', 'bbox': [0.34, 0.21, 0.85, 0.5},
        ]

grounding_sam(prompt: str, image: numpy.ndarray, box_threshold: float = 0.2, iou_threshold: float = 0.2) -> List[Dict[str, Any]]:
'grounding_sam' is a tool that can segment multiple objects given a
    text prompt such as category names or referring expressions. The categories in text
    prompt are separated by commas or periods. It returns a list of bounding boxes,
    label names, mask file names and associated probability scores.

    Parameters:
        prompt (str): The prompt to ground to the image.
        image (np.ndarray): The image to ground the prompt to.
        box_threshold (float, optional): The threshold for the box detection. Defaults
            to 0.20.
        iou_threshold (float, optional): The threshold for the Intersection over Union
            (IoU). Defaults to 0.20.

    Returns:
        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
            bounding box, and mask of the detected objects with normalized coordinates
            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
            the background.

    Example
    -------
        >>> grounding_sam("car. dinosaur", image)
        [
            {
                'score': 0.99,
                'label': 'dinosaur',
                'bbox': [0.1, 0.11, 0.35, 0.4],
                'mask': array([[0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0],
                    ...,
                    [0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
            },
        ]

detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]:
'detr_segmentation' is a tool that can segment common objects in an
    image without any text prompt. It returns a list of detected objects
    as labels, their regions as masks and their scores.

    Parameters:
        image (np.ndarray): The image used to segment things and objects

    Returns:
        List[Dict[str, Any]]: A list of dictionaries containing the score, label
            and mask of the detected objects. The mask is binary 2D numpy array where 1
            indicates the object and 0 indicates the background.

    Example
    -------
        >>> detr_segmentation(image)
        [
            {
                'score': 0.45,
                'label': 'window',
                'mask': array([[0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0],
                    ...,
                    [0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
            },
            {
                'score': 0.70,
                'label': 'bird',
                'mask': array([[0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0],
                    ...,
                    [0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
            },
        ]

overlay_bounding_boxes(image: numpy.ndarray, bboxes: List[Dict[str, Any]]) -> numpy.ndarray:
'overlay_bounding_boxes' is a utility function that displays bounding boxes on
    an image.

    Parameters:
        image (np.ndarray): The image to display the bounding boxes on.
        bboxes (List[Dict[str, Any]]): A list of dictionaries containing the bounding
            boxes.

    Returns:
        np.ndarray: The image with the bounding boxes, labels and scores displayed.

    Example
    -------
        >>> image_with_bboxes = overlay_bounding_boxes(
            image, [{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}],
        )

overlay_heat_map(image: numpy.ndarray, heat_map: Dict[str, Any], alpha: float = 0.8) -> numpy.ndarray:
'overlay_heat_map' is a utility function that displays a heat map on an image.

    Parameters:
        image (np.ndarray): The image to display the heat map on.
        heat_map (Dict[str, Any]): A dictionary containing the heat map under the key
            'heat_map'.
        alpha (float, optional): The transparency of the overlay. Defaults to 0.8.

    Returns:
        np.ndarray: The image with the heat map displayed.

    Example
    -------
        >>> image_with_heat_map = overlay_heat_map(
            image,
            {
                'heat_map': array([[0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0],
                    ...,
                    [0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 125, 125, 125]], dtype=uint8),
            },
        )

overlay_segmentation_masks(image: numpy.ndarray, masks: List[Dict[str, Any]]) -> numpy.ndarray:
'overlay_segmentation_masks' is a utility function that displays segmentation
    masks.

    Parameters:
        image (np.ndarray): The image to display the masks on.
        masks (List[Dict[str, Any]]): A list of dictionaries containing the masks.

    Returns:
        np.ndarray: The image with the masks displayed.

    Example
    -------
        >>> image_with_masks = overlay_segmentation_masks(
            image,
            [{
                'score': 0.99,
                'label': 'dinosaur',
                'mask': array([[0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0],
                    ...,
                    [0, 0, 0, ..., 0, 0, 0],
                    [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
            }],
        )

## Vision Agent Tools - model summary

- any mistakes in the following table are mine.  my efforts to do some QUICK reverse engineering to identify target models.

| Model Name          | Hugging Face Model                  | Primary Function               | Use Cases                                                    |
|---------------------|-------------------------------------|-------------------------------|--------------------------------------------------------------|
| OWL-ViT v2          | google/owlv2-base-patch16-ensemble  | Object detection and localization | - Open-world object detection<br>- Locating specific objects based on text prompts |
| Florence-2          | microsoft/florence-base             | Multi-purpose vision tasks      | - Image captioning<br>- Visual question answering<br>- Object detection |
| Depth Anything V2   | LiheYoung/depth-anything-v2-small   | Depth estimation                | - Estimating depth in images<br>- Generating depth maps      |
| CLIP                | openai/clip-vit-base-patch32        | Image-text similarity           | - Zero-shot image classification<br>- Image-text matching    |
| BLIP                | Salesforce/blip-image-captioning-base | Image captioning                | - Generating text descriptions of images                    |
| LOCA                | Custom implementation               | Object counting                 | - Zero-shot object counting<br>- Object counting with visual prompts |
| GIT v2              | microsoft/git-base-vqav2            | Visual question answering and image captioning | - Answering questions about image content<br>- Generating text descriptions of images |
| Grounding DINO      | groundingdino/groundingdino-swint-ogc | Object detection and localization | - Detecting objects based on text prompts                   |
| SAM                 | facebook/sam-vit-huge               | Instance segmentation           | - Text-prompted instance segmentation                       |
| DETR                | facebook/detr-resnet-50             | Object detection                | - General object detection                                  |
| ViT                 | google/vit-base-patch16-224         | Image classification            | - General image classification<br>- NSFW content detection  |
| DPT                 | Intel/dpt-hybrid-midas              | Monocular depth estimation      | - Estimating depth from single images                       |