Transformers documentation

SAM3 Video

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.57.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

This model was released on 2025-11-19 and added to Hugging Face Transformers on 2025-11-19.

SAM3 Video

PyTorch SDPA FlashAttention

Overview

SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.

SAM3 Video performs Promptable Concept Segmentation (PCS) on videos. PCS takes text as input (e.g., “yellow school bus”), and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames.

The model combines a detection module (SAM3) with a tracking module (SAM2-style tracker) to enable robust object tracking across video frames using text prompts.

The abstract from the paper is the following:

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

This model was contributed by yonigozlan and ronghanghu.

Usage example

Video Segmentation and Tracking

Pre-loaded Video Inference

Process a video with all frames already available using text prompts:

>>> from transformers import Sam3VideoModel, Sam3VideoProcessor
>>> from accelerate import Accelerator
>>> import torch

>>> device = Accelerator().device
>>> model = Sam3VideoModel.from_pretrained("facebook/sam3").to(device, dtype=torch.bfloat16)
>>> processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")

>>> # Load video frames
>>> from transformers.video_utils import load_video
>>> video_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/bedroom.mp4"
>>> video_frames, _ = load_video(video_url)

>>> # Initialize video inference session
>>> inference_session = processor.init_video_session(
...     video=video_frames,
...     inference_device=device,
...     processing_device="cpu",
...     video_storage_device="cpu",
...     dtype=torch.bfloat16,
... )

>>> # Add text prompt to detect and track objects
>>> text = "person"
>>> inference_session = processor.add_text_prompt(
...     inference_session=inference_session,
...     text=text,
... )

>>> # Process all frames in the video
>>> outputs_per_frame = {}
>>> for model_outputs in model.propagate_in_video_iterator(
...     inference_session=inference_session, max_frame_num_to_track=50
... ):
...     processed_outputs = processor.postprocess_outputs(inference_session, model_outputs)
...     outputs_per_frame[model_outputs.frame_idx] = processed_outputs

>>> print(f"Processed {len(outputs_per_frame)} frames")
Processed 51 frames

>>> # Access results for a specific frame
>>> frame_0_outputs = outputs_per_frame[0]
>>> print(f"Detected {len(frame_0_outputs['object_ids'])} objects")
>>> print(f"Object IDs: {frame_0_outputs['object_ids'].tolist()}")
>>> print(f"Scores: {frame_0_outputs['scores'].tolist()}")
>>> print(f"Boxes shape (XYXY format, absolute coordinates): {frame_0_outputs['boxes'].shape}")
>>> print(f"Masks shape: {frame_0_outputs['masks'].shape}")

Streaming Video Inference

⚠️ **Note on Streaming Inference Quality**: Streaming inference disables hotstart heuristics that remove unmatched and duplicate objects, as these require access to future frames to make informed decisions. This may result in more false positive detections and duplicate object tracks compared to pre-loaded video inference. For best results, use pre-loaded video inference when all frames are available.

For real-time applications, SAM3 Video supports processing video frames as they arrive:

>>> # Initialize session for streaming
>>> streaming_inference_session = processor.init_video_session(
...     inference_device=device,
...     processing_device="cpu",
...     video_storage_device="cpu",
...     dtype=torch.bfloat16,
... )

>>> # Add text prompt
>>> text = "person"
>>> streaming_inference_session = processor.add_text_prompt(
...     inference_session=streaming_inference_session,
...     text=text,
... )

>>> # Process frames one by one (streaming mode)
>>> streaming_outputs_per_frame = {}
>>> for frame_idx, frame in enumerate(video_frames[:50]):  # Process first 50 frames
...     # First, process the frame using the processor
...     inputs = processor(images=frame, device=device, return_tensors="pt")
...
...     # Process frame using streaming inference - pass the processed pixel_values
...     model_outputs = model(
...         inference_session=streaming_inference_session,
...         frame=inputs.pixel_values[0],  # Provide processed frame - this enables streaming mode
...         reverse=False,
...     )
...
...     # Post-process outputs with original_sizes for proper resolution handling
...     processed_outputs = processor.postprocess_outputs(
...         streaming_inference_session,
...         model_outputs,
...         original_sizes=inputs.original_sizes,  # Required for streaming inference
...     )
...     streaming_outputs_per_frame[frame_idx] = processed_outputs
...
...     if (frame_idx + 1) % 10 == 0:
...         print(f"Processed {frame_idx + 1} frames...")

>>> print(f"✓ Streaming inference complete! Processed {len(streaming_outputs_per_frame)} frames")
✓ Streaming inference complete! Processed 50 frames

>>> # Access results
>>> frame_0_outputs = streaming_outputs_per_frame[0]
>>> print(f"Detected {len(frame_0_outputs['object_ids'])} objects in first frame")
>>> print(f"Boxes are in XYXY format (absolute pixel coordinates): {frame_0_outputs['boxes'].shape}")
>>> print(f"Masks are at original video resolution: {frame_0_outputs['masks'].shape}")

Sam3VideoConfig

class transformers.Sam3VideoConfig

< >

( detector_config = None tracker_config = None initializer_range = 0.02 low_res_mask_size = 288 score_threshold_detection = 0.5 det_nms_thresh = 0.1 assoc_iou_thresh = 0.1 trk_assoc_iou_thresh = 0.5 new_det_thresh = 0.7 recondition_on_trk_masks = True hotstart_delay = 15 hotstart_unmatch_thresh = 8 hotstart_dup_thresh = 8 suppress_unmatched_only_within_hotstart = True init_trk_keep_alive = 30 max_trk_keep_alive = 30 min_trk_keep_alive = -1 suppress_overlapping_based_on_recent_occlusion_threshold = 0.7 decrease_trk_keep_alive_for_empty_masklets = False fill_hole_area = 16 max_num_objects = 10000 recondition_every_nth_frame = 16 high_conf_thresh = 0.8 high_iou_thresh = 0.8 **kwargs )

Parameters

  • detector_config (dict or Sam3Config, optional) — Configuration for the Sam3 detector model. If not provided, default Sam3Config will be used.
  • tracker_config (dict or Sam2VideoConfig, optional) — Configuration for the Sam2Video tracker model. If not provided, default Sam2VideoConfig will be used.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing weight matrices.
  • low_res_mask_size (int, optional, defaults to 288) — Size (height and width) of the low-resolution mask outputs from the tracker before upsampling to video resolution.
  • score_threshold_detection (float, optional, defaults to 0.5) — Probability threshold for detection outputs - only keep detections above this threshold.
  • det_nms_thresh (float, optional, defaults to 0.1) — IoU threshold for detection NMS (Non-Maximum Suppression).
  • assoc_iou_thresh (float, optional, defaults to 0.1) — IoU threshold for detection-to-track matching. A detection is considered “matched” to a tracklet if it overlaps with the tracklet above this threshold. Often a loose threshold like 0.1.
  • trk_assoc_iou_thresh (float, optional, defaults to 0.5) — IoU threshold for detection-to-track matching, used to determine whether a masklet is “unmatched” by any detections. Often a stricter threshold like 0.5.
  • new_det_thresh (float, optional, defaults to 0.7) — Probability threshold for a detection to be added as a new object.
  • recondition_on_trk_masks (bool, optional, defaults to True) — Whether to use tracked masks (True) or detection masks (False) for reconditioning. Use True when tracked masks are higher quality and detector serves as validation signal to strengthen memory and prevent drift.
  • hotstart_delay (int, optional, defaults to 15) — Number of frames to buffer outputs during hotstart. We hold off the outputs for hotstart_delay frames and remove tracklets based on hotstart heuristics.
  • hotstart_unmatch_thresh (int, optional, defaults to 8) — Number of unmatched frames required to remove a tracklet during hotstart period.
  • hotstart_dup_thresh (int, optional, defaults to 8) — Number of overlapping frames required to remove a duplicate tracklet during hotstart period.
  • suppress_unmatched_only_within_hotstart (bool, optional, defaults to True) — Whether to suppress masks only within hotstart period. If False, we can suppress masks even if they start before hotstart period.
  • init_trk_keep_alive (int, optional, defaults to 30) — Initial keep-alive counter for new tracks.
  • max_trk_keep_alive (int, optional, defaults to 30) — Maximum keep-alive counter value. Tracks with matched detections get their counter increased up to this value.
  • min_trk_keep_alive (int, optional, defaults to -1) — Minimum keep-alive counter value. Tracks with unmatched detections get their counter decreased to this value.
  • suppress_overlapping_based_on_recent_occlusion_threshold (float, optional, defaults to 0.7) — Threshold for suppressing overlapping objects based on recent occlusion. Overlapping masks with IoU above this threshold are suppressed based on which was most recently occluded.
  • decrease_trk_keep_alive_for_empty_masklets (bool, optional, defaults to False) — Whether to decrease keep-alive counter for masklets with zero area in SAM2 prediction.
  • fill_hole_area (int, optional, defaults to 16) — Minimum area (in pixels) for filling holes in masks and removing small sprinkles.
  • max_num_objects (int, optional, defaults to 10000) — Maximum number of objects to track. Default 10000 effectively turns off this limit.
  • recondition_every_nth_frame (int, optional, defaults to 16) — Frequency of mask reconditioning (in frames). Set to 0 to disable reconditioning.
  • high_conf_thresh (float, optional, defaults to 0.8) — High confidence threshold for reconditioning. Only detections above this threshold can recondition tracklets.
  • high_iou_thresh (float, optional, defaults to 0.8) — High IoU threshold for reconditioning. Only detections with IoU above this threshold can recondition tracklets.

Configuration class for Sam3VideoModel. This combines configurations for the detector (Sam3) and tracker (Sam2Video) components, along with detection-tracking fusion hyperparameters.

Instantiating a configuration defaults will yield a similar configuration to that of SAM 3 facebook/sam3 architecture.

This model integrates detection and tracking with various fusion heuristics including NMS, association, hotstart, reconditioning, and occlusion handling.

Example:

>>> from transformers import Sam3VideoConfig, Sam3VideoModel

>>> # Initializing a SAM3 Video configuration with default detector and tracker
>>> configuration = Sam3VideoConfig()

>>> # Initializing a model from the configuration
>>> model = Sam3VideoModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
>>> detector_config = configuration.detector_config
>>> tracker_config = configuration.tracker_config

Sam3VideoProcessor

class transformers.Sam3VideoProcessor

< >

( image_processor video_processor tokenizer target_size: typing.Optional[int] = None **kwargs )

Parameters

  • image_processor (Sam2ImageProcessorFast) — An instance of Sam2ImageProcessorFast.
  • video_processor (Sam2VideoVideoProcessor) — An instance of Sam2VideoVideoProcessor.
  • tokenizer ([PreTrainedTokenizer, PreTrainedTokenizerFast]) — An instance of [PreTrainedTokenizer, PreTrainedTokenizerFast]. The tokenizer is a required input.
  • target_size (int, optional) — The target size (target_size, target_size) to which the image will be resized.

Constructs a SAM3 processor which wraps a SAM3 image processor and an 2D points & Bounding boxes processor into a single processor.

Sam3Processor offers all the functionalities of Sam3ImageProcessor and Sam3VideoProcessor. See the docstring of ~Sam3ImageProcessor.__call__ and call() for more information.

__call__

< >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None original_sizes: typing.Union[list[list[float]], torch.Tensor, NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None **kwargs ) A BatchEncoding with the following fields

Parameters

  • images (ImageInput, optional) — The image(s) to process.
  • segmentation_maps (ImageInput, optional) — The segmentation maps to process (optional, for image processor).
  • original_sizes (list[list[float]], torch.Tensor, optional) — The original sizes of the images. Only used when images is not provided.
  • return_tensors (str or TensorType, optional) — The type of tensors to return.
  • **kwargs — Additional keyword arguments to pass to the image processor.

Returns

A BatchEncoding with the following fields

  • pixel_values (torch.Tensor): The processed image(s).
  • original_sizes (list[list[float]]): The original sizes of the images.
  • labels (torch.Tensor, optional): The processed segmentation maps (if provided).

This method uses Sam3VideoImageProcessorFast.__call__ method to prepare image(s) for the model.

postprocess_outputs

< >

( inference_session model_outputs original_sizes: typing.Union[list[list[float]], torch.Tensor, NoneType] = None ) dict

Parameters

  • inference_session (Sam3VideoInferenceSession) — The inference session object.
  • model_outputs (Sam3VideoSegmentationOutput) — The raw model output from Sam3VideoModel.forward().
  • original_sizes (list[list[float]] or torch.Tensor, optional) — Optional original frame sizes [height, width]. Required for streaming inference when video_height/video_width are not set in the session.

Returns

dict

A dictionary containing the following keys:

  • object_ids (torch.Tensor of shape (num_objects,)): Object IDs for each detected object.
  • scores (torch.Tensor of shape (num_objects,)): Detection scores for each object.
  • boxes (torch.Tensor of shape (num_objects, 4)): Bounding boxes in XYXY format (top_left_x, top_left_y, bottom_right_x, bottom_right_y).
  • masks (torch.Tensor of shape (num_objects, height, width)): Binary segmentation masks for each object at the original video resolution.

Post-process model outputs to get final masks, boxes, and scores.

init_video_session

< >

( video: typing.Union[list['PIL.Image.Image'], numpy.ndarray, ForwardRef('torch.Tensor'), list[numpy.ndarray], list['torch.Tensor'], list[list['PIL.Image.Image']], list[list[numpy.ndarray]], list[list['torch.Tensor']], transformers.video_utils.URL, list[transformers.video_utils.URL], list[list[transformers.video_utils.URL]], transformers.video_utils.Path, list[transformers.video_utils.Path], list[list[transformers.video_utils.Path]], NoneType] = None inference_device: typing.Union[str, ForwardRef('torch.device')] = 'cpu' inference_state_device: typing.Union[str, ForwardRef('torch.device'), NoneType] = None processing_device: typing.Union[str, ForwardRef('torch.device'), NoneType] = None video_storage_device: typing.Union[str, ForwardRef('torch.device'), NoneType] = None max_vision_features_cache_size: int = 1 dtype: dtype = torch.float32 )

Parameters

  • video (VideoInput, optional) — The video to process. No need to provide when streaming.
  • inference_device (str or torch.device, optional, defaults to “cpu”) — The device to use for inference.
  • inference_state_device (str or torch.device, optional) — The device to store the inference state on.
  • processing_device (str or torch.device, optional) — The device to use for video processing.
  • video_storage_device (str or torch.device, optional) — The device to store the processed video frames on.
  • max_vision_features_cache_size (int, optional, defaults to 1) — The maximum number of vision features to cache.
  • dtype (torch.dtype, optional, defaults to torch.float32) — The torch dtype to use for the whole session.

Initializes a video session for inference. If a video is provided (async inference), the video will be processed and stored on the video_storage_device.

add_text_prompt

< >

( inference_session text )

Add text prompt to the inference session.

Sam3VideoInferenceSession

class transformers.Sam3VideoInferenceSession

< >

( video: typing.Optional[torch.FloatTensor] = None video_height: typing.Optional[int] = None video_width: typing.Optional[int] = None inference_device: typing.Union[torch.device, str] = 'cpu' inference_state_device: typing.Union[torch.device, str] = 'cpu' video_storage_device: typing.Union[torch.device, str] = 'cpu' dtype: typing.Union[torch.dtype, str] = 'float32' max_vision_features_cache_size: int = 1 )

Parameters

  • video (torch.FloatTensor, optional) — The video to process. No need to provide when streaming.
  • video_height (int, optional) — The height of the video.
  • video_width (int, optional) — The width of the video.
  • inference_device (torch.device, optional, defaults to "cpu") — The device to use for inference.
  • inference_state_device (torch.device, optional, defaults to "cpu") — The device to store the inference state on.
  • video_storage_device (torch.device, optional, defaults to "cpu") — The device to store the video on.
  • dtype (torch.dtype, optional, defaults to "float32") — The dtype to use for the video.
  • max_vision_features_cache_size (int, optional, defaults to 1) — The maximum number of vision features to cache.

Manages video inference session parameters, state and cache.

add_mask_inputs

< >

( obj_idx: int frame_idx: int inputs: Tensor )

Add mask inputs with automatic device placement.

add_new_frame

< >

( pixel_values: Tensor frame_idx: typing.Optional[int] = None )

Add new frame with automatic device placement.

get_frame

< >

( frame_idx: int )

Get frame from video.

get_obj_num

< >

( )

Get the total number of unique object ids received so far in this session.

get_output

< >

( obj_idx: int frame_idx: int output_key: str is_conditioning_frame: bool = True )

Parameters

  • obj_idx (int) — The index of the object.
  • frame_idx (int) — The index of the frame.
  • output_key (str) — The key of the output.
  • is_conditioning_frame (bool) — Whether the output is for a conditioning frame.

Get output with smart device management.

obj_id_to_idx

< >

( obj_id: int )

Map object ID to index, creating new entry if needed.

obj_idx_to_id

< >

( obj_idx: int )

Map model-side object index to client-side object id.

remove_mask_inputs

< >

( obj_idx: int frame_idx: int )

Remove mask inputs.

remove_object

< >

( obj_id: int strict: bool = False )

Parameters

  • obj_id (int) — The object ID to remove.

Remove an object from the inference session. This would remove the object from all frames in the video.

reset_inference_session

< >

( )

Reset tracking data and cache.

reset_state

< >

( )

Reset the inference session state.

reset_tracking_data

< >

( )

Reset tracking data but keep cache.

store_output

< >

( obj_idx: int frame_idx: int output_key: typing.Optional[str] = None output_value: typing.Union[torch.Tensor, dict, NoneType] = None is_conditioning_frame: bool = True )

Parameters

  • obj_idx (int) — The index of the object.
  • frame_idx (int) — The index of the frame.
  • output_key (Optional[str]) — The key of the output. If None, the output is stored as a dictionary.
  • output_value (Optional[Union[torch.Tensor, dict]]) — The value of the output.
  • is_conditioning_frame (bool) — Whether the output is for a conditioning frame.

Store output with smart device management. If output_key is None, the output is stored as a dictionary.

Sam3VideoSegmentationOutput

class transformers.Sam3VideoSegmentationOutput

< >

( object_ids: typing.Optional[list[int]] = None obj_id_to_mask: typing.Optional[dict[int, torch.FloatTensor]] = None obj_id_to_score: typing.Optional[dict[int, float]] = None obj_id_to_tracker_score: typing.Optional[dict[int, float]] = None removed_obj_ids: typing.Optional[set[int]] = None suppressed_obj_ids: typing.Optional[set[int]] = None frame_idx: typing.Optional[int] = None )

Parameters

  • object_ids (list[int], optional) — List of object IDs being tracked in the current frame.
  • obj_id_to_mask (dict[int, torch.FloatTensor], optional) — Dictionary mapping object IDs to their predicted low-resolution masks. Each mask has shape (1, H_low, W_low).
  • obj_id_to_score (dict[int, float], optional) — Dictionary mapping object IDs to their detection scores.
  • obj_id_to_tracker_score (dict[int, float], optional) — Dictionary mapping object IDs to their tracker scores for the current frame.
  • removed_obj_ids (set[int], optional) — Set of object IDs that have been removed (e.g., via hotstart heuristics).
  • suppressed_obj_ids (set[int], optional) — Set of object IDs that have been suppressed in the current frame.
  • frame_idx (int, optional) — The frame index of the video.

Base class for the Sam3Video model’s output.

Sam3VideoModel

class transformers.Sam3VideoModel

< >

( config: Sam3VideoConfig )

Parameters

  • config (Sam3VideoConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare Sam3 Video Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( inference_session: Sam3VideoInferenceSession frame_idx: typing.Optional[int] = None frame: typing.Optional[torch.Tensor] = None reverse: bool = False )

Parameters

  • inference_session (~models.sam3_video.modeling_sam3_video.Sam3VideoInferenceSession) — The video inference session object.
  • frame_idx (int, optional) — The index of the frame on which to run inference. No need to provide when inferring on a new streamed frame.
  • frame (torch.Tensor, optional) — The frame to process. Provide when streaming.
  • reverse (bool, optional, defaults to False) — Whether to propagate in reverse.

Propagate the objects through a streamed video frame.

propagate_in_video_iterator

< >

( inference_session: Sam3VideoInferenceSession start_frame_idx = 0 max_frame_num_to_track = None reverse = False ) Sam3VideoSegmentationOutput

Yields

Sam3VideoSegmentationOutput

Propagate the prompts to get grounding results for the entire video. This method is a generator and yields inference outputs for all frames in the range specified by start_frame_idx, max_frame_num_to_track, and reverse.

Update on GitHub