Latest vision & multimodal releases in transformers

#8
by mervenoyan - opened

In this post you can find short summaries & snippets for latest vision & multimodal releases of transformers, namely, following models: D-FINE (sota object detection), Qwen2.5 Omni, HQ-SAM.

D-FINE

D-FINE is the sota real-time object detector that runs on T4 (free Colab). With transformers release there comes notebook examples for tracking, inference and fine-tuning as well as a demo.
Here's model specific install:

pip install git+https://github.com/huggingface/transformers@v4.51.3-D-FINE-preview

Inference:

import torch
import requests

from PIL import Image
from transformers import DFineForObjectDetection, AutoImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("ustc-community/dfine-large-obj365")
model = DFineForObjectDetection.from_pretrained("ustc-community/dfine-large-obj365")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")

Qwen2.5-Omni

7B any-to-any model by Qwen, image, audio, video, text input to text and natural speech output.
Here's model specific install

pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview

Inference with text and video input. Feel free to swap with images in chat template when inferring with images.

import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
            {"type": "text", "text": "What can you hear and see in this video?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=2,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
)

# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, use_audio_in_video=True)

text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)
sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)

Model collection: https://huggingface.co/collections/Qwen/qwen25-omni-67de1e5f0f9464dc6314b36e

SAM-HQ

SAM-HQ is an improvement on top SAM to create higher quality masks, by using more features from image encoder, refining masks with an additional MLP and adding a novel token for decoder inputs.
Here's model specific install:

pip install git+https://github.com/huggingface/transformers@v4.51.3-SAM-HQ-preview

Point inference

import torch
from PIL import Image
import requests
from transformers import SamHQModel, SamHQProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SamHQModel.from_pretrained("sushmanth/sam_hq_vit_b").to(device)
processor = SamHQProcessor.from_pretrained("sushmanth/sam_hq_vit_b")

img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_points = [[[450, 600]]]  # 2D location of a window in the image

inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()
)
scores = outputs.iou_scores
``` 

Sign up or log in to comment