How can i obtain a feature representation for this fine-tuned model?

by bbff - opened Sep 15, 2022

bbff

Sep 15, 2022

How can i obtain a feature representation for this fine-tuned model? not the feature representation in the pre-training model

boomshark

Jul 17, 2023

hi，May I ask if you have solved this problem? I also want to obtain the features of the model. Can you please advise me on how to do it?

nielsr

Jul 17, 2023

Hi,

You can easily get a feature representation of a video by either average pooling the final hidden states of all the patch tokens, or using the final hidden state of the special CLS token:

from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch

num_frames = 16
video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

pixel_values = processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
feature = outputs.last_hidden_state.mean(dim=1)

boomshark

Sep 5, 2023

@nielsr ，Sorry to bother you, it's like this. I want to put my own dataset into this video capture model for processing and obtain a statement that describes this video, or an intermediate feature vector that reflects the features of this video. However, since the training set of this model is not my own dataset, the results obtained are very poor. Is there any good solution, Choose a new model with strong robustness for processing? If that's the case, do you have any recommendations or other options? Looking forward to your reply, thank you very much

ylei

May 13

Hi,

You can easily get a feature representation of a video by either average pooling the final hidden states of all the patch tokens, or using the final hidden state of the special CLS token:

from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch

num_frames = 16
video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

pixel_values = processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
feature = outputs.last_hidden_state.mean(dim=1)

what ifI want to get data from my own video?

nielsr

May 13

You can use VideoMAEImageProcessor to prepare your own video for the model.

tig-tog

Jul 18

@boomshark Hi I am looking to do a very similar task! Just wondering if you solved this using VideoMAE or if you switched models?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment