Update README.md

2f5c76b over 1 year ago

3.72 kB

	---
	license: mit
	tags:
	- vision
	- image-segmentation
	datasets:
	- YouTubeVIS-2019
	---

	# Video Mask2Former

	Video Mask2Former model trained on YouTubeVIS-2019 instance segmentation (tiny-sized version, Swin backbone). It was introduced in the paper [Mask2Former for Video Instance Segmentation
	](https://arxiv.org/abs/2112.10764) and first released in [this repository](https://github.com/facebookresearch/Mask2Former/).
	Video Mask2Former is an extension of the original Mask2Former paper released under the name, [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527).

	Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team.

	## Model description

	Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA,
	[MaskFormer](https://arxiv.org/abs/2107.06278) both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without
	without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.
	In the paper [Mask2Former for Video Instance Segmentation
	](https://arxiv.org/abs/2112.10764), the authors have shown that Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.

	![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/mask2former_architecture.png)

	## Intended uses & limitations

	You can use this particular checkpoint for instance segmentation. See the [model hub](https://huggingface.co/models?search=video-mask2former) to look for other fine-tuned versions of this model that may interest you.

	### How to use

	Here is how to use this model:

	```python
	import requests
	import torch
	import torchvision
	from PIL import Image
	from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation


	# load Mask2Former fine-tuned on COCO instance segmentation
	processor = AutoImageProcessor.from_pretrained("facebook/video-mask2former-swin-tiny-youtubevis-2019-instance")
	model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/video-mask2former-swin-tiny-youtubevis-2019-instance")

	file_path = hf_hub_download(repo_id="shivi/video-demo", filename="cars.mp4", repo_type="dataset")
	video = torchvision.io.read_video(file_path)[0]
	video_frames = [image_processor(images=frame, return_tensors="pt").pixel_values for frame in video]
	video_input = torch.cat(video_frames)

	with torch.no_grad():
	outputs = model(**video_input)

	# model predicts class_queries_logits of shape `(batch_size, num_queries)`
	# and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
	class_queries_logits = outputs.class_queries_logits
	masks_queries_logits = outputs.masks_queries_logits

	# you can pass them to processor for postprocessing
	result = processor.image_processor.post_process_video_instance_segmentation(outputs, target_sizes=[tuple(video.shape[1:3])])[0]
	# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)
	predicted_video_instance_map = result["segmentation"]
	```

	For more code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/master/en/model_doc/mask2former).