siglip-video_16384

This model was trained from scratch on an unknown dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 16
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 128
total_eval_batch_size: 8
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.03
num_epochs: 1.0

Training results

Framework versions

Transformers 4.47.1
Pytorch 2.5.1+cu124
Datasets 2.18.0
Tokenizers 0.21.0

Wandb

https://wandb.ai/dongfu/Mantis/runs/tatohkvl?nw=nwuserdongfu

code


from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
from mantis.models.siglip_video import SiglipVideoModel
import torch
import numpy as np
import av
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    if len(indices) == 0:
        # to debug
        indices = [0]
        print("No indices to decode, might be an empty video please check")
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

# model = SiglipVideoModel.from_pretrained("google/siglip-so400m-patch14-384")
model = SiglipVideoModel.from_pretrained("Mantis-VL/siglip-video_16384_2fps_128").to("cuda:2")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")


container = av.open("../mochi.mp4")
# container = av.open("/home/dongfu/WorkSpace/Mantis/data/llava-video/data/0_30_s_youtube_v0_1/videos/liwei_youtube_videos/videos/youtube_video_2024/ytb_F-FpE2GWW84.mp4")
total_frames = container.streams.video[0].frames
sample_fps = 2
ori_fps = container.streams.video[0].average_rate
indices = np.arange(0, total_frames, int(ori_fps/sample_fps))
frames = read_video_pyav(container, indices)

text = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
# text = "The video showcases a group of individuals dressed in matching military-style uniforms, consisting of long, light-colored tunics and dark vests, marching in unison. They are carrying large, black, shoulder-mounted weapons, and the background appears to be an open area, possibly a parade ground or a military base, with a clear sky overhead. The text overlay in English reads, 'Talabani won victory over America with an impossible weapon,' suggesting a narrative of triumph using unconventional means. The individuals are seen marching in a coordinated manner, emphasizing discipline and uniformity. As the video progresses, the group continues their synchronized march, maintaining the same background setting. The text overlay, 'Talabani won victory over America with an impossible weapon,' reappears, reinforcing the narrative of triumph. One individual in the foreground is prominently holding a rifle, adding to the display of military prowess. The video emphasizes the themes of discipline, coordination, and military strength."

print(frames.shape)
inputs = processor(text=[text], images=frames, padding="max_length", return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
inputs['pixel_values'] = [inputs['pixel_values']]
with torch.no_grad():
    outputs = model(**inputs)
logits_per_video = outputs.logits_per_video
print(logits_per_video)
probs = torch.sigmoid(logits_per_video) # these are the probabilities
print(f"{probs[0][0]:.1%} the video contains the text: '{text}'")

Mantis-VL
/

siglip-video_16384_2fps_128

siglip-video_16384

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Wandb

code

Evaluation results