How can i obtain a feature representation for this fine-tuned model?

#2
by bbff - opened

How can i obtain a feature representation for this fine-tuned model? not the feature representation in the pre-training model

hi,May I ask if you have solved this problem? I also want to obtain the features of the model. Can you please advise me on how to do it?

Hi,

You can easily get a feature representation of a video by either average pooling the final hidden states of all the patch tokens, or using the final hidden state of the special CLS token:

from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch

num_frames = 16
video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

pixel_values = processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
feature = outputs.last_hidden_state.mean(dim=1)

@nielsr ,Sorry to bother you, it's like this. I want to put my own dataset into this video capture model for processing and obtain a statement that describes this video, or an intermediate feature vector that reflects the features of this video. However, since the training set of this model is not my own dataset, the results obtained are very poor. Is there any good solution, Choose a new model with strong robustness for processing? If that's the case, do you have any recommendations or other options? Looking forward to your reply, thank you very much

Hi,

You can easily get a feature representation of a video by either average pooling the final hidden states of all the patch tokens, or using the final hidden state of the special CLS token:

from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch

num_frames = 16
video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

pixel_values = processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
feature = outputs.last_hidden_state.mean(dim=1)

what ifI want to get data from my own video?

You can use VideoMAEImageProcessor to prepare your own video for the model.

@boomshark Hi I am looking to do a very similar task! Just wondering if you solved this using VideoMAE or if you switched models?

Sign up or log in to comment