rawhad/grounding-dino-base-screen-ai-v1

Hey, love the idea. I was trying to test the model with the code below -

from transformers import AutoProcessor, GroundingDinoForObjectDetection
import torch
from PIL import Image
from PIL import Image, ImageDraw
import matplotlib.pyplot as plt

url = "test.png"
img = Image.open(url).convert("RGB")
draw = ImageDraw.Draw(img)
model_id = "rawhad/grounding-dino-base-screen-ai-v1"
text = "" # not sure what to put here, is it required?

image_processor = AutoProcessor.from_pretrained(model_id)
model = GroundingDinoForObjectDetection.from_pretrained(model_id)

inputs = image_processor(images=img, text=text, return_tensors="pt")
outputs = model(**inputs)

target_sizes = torch.tensor([img.size[::-1]])
results = image_processor.image_processor.post_process_object_detection(
    outputs, threshold=0.85, target_sizes=target_sizes
)[0]

print(results)

Not sure what needs to go in for text (if it is needed at all). I have tried with an empty string and "ui components" and both didn't give me any results.

Thank you for your work!

Hey there @04RR

Actually I have been trying to train this model, but cannot get it to work right. So this is still a WIP. The code you wrote is correct. Few suggestions.

Start with original GroundingDINO model.
Text is the description of the object you want to detect in the image, e.g.: "a cat lying on the sofa"
For the post processing use the processor.post_process_grounded_object_detection like below:

results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.28,
    text_threshold=0.0,
    target_sizes=[image.size[::-1]]
)[0]

You can play with the box and text thresholds.

Box threshold: Confidence threshold of the bounding box
Text threshold: CLIP confidence threshold for the region in bounding box and text

For now this model is still a WIP. Will update when it's trained.

rawhad
/

grounding-dino-base-screen-ai-v1

Inference code?