SAM 2: Segment Anything in Images and Videos
Paper โข 2408.00714 โข Published โข 123
How to use Reza2kn/sam3.1-nvfp4-detector-no-language with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="Reza2kn/sam3.1-nvfp4-detector-no-language") # Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("Reza2kn/sam3.1-nvfp4-detector-no-language")
model = AutoModelForMultimodalLM.from_pretrained("Reza2kn/sam3.1-nvfp4-detector-no-language")NVFP4 quantized variant of facebook/sam3.1. Quantizes the detector backbone's vision trunk while keeping the language backbone in FP32.
| Property | Value |
|---|---|
| Method | NVFP4 (custom_nvfp4_e2m1_e4m3_scales) |
| Block size | 16 |
| Scale rule | static_6 |
| Quantized scope | detector.* (vision trunk only) |
| Language backbone | FP32 (kept raw for prompt accuracy) |
| Total parameters | 874 M |
| Quantized parameters | 472 M (54.0%) |
| FP32-kept parameters | 402 M (46.0%) |
Validated against source SAM 3.1 on mukbang / food video frames with Sapiens-2 human-part exclusion masking:
| Threshold | Mean foreground IoU | Pixel agreement |
|---|---|---|
| conf = 0.35 | 0.988 | 0.99897 |
Some NVFP4 masks are visually sharper than the source FP32 output due to quantization-induced de-noising.
| File | Description |
|---|---|
nvfp4_model.safetensors |
NVFP4-packed detector tensors |
sam3.1_multiplex.pt |
FP32 non-quantized tensors (language backbone, heads) |
config.json |
SAM 3.1 model config |
quantization_config.json |
NVFP4 packing metadata |
quant_error_report.json |
Per-tensor L2 error report |
tokenizer*.json / vocab.json / merges.txt |
Language-backbone tokenizer |
# Load the quantized model using the nomnomlabel loader
from nomnomlabel.quant_loader import load_sam3_nvfp4
model = load_sam3_nvfp4("Reza2kn/sam3.1-nvfp4-detector-no-language")
# Segment food (with language prompt)
from nomnomlabel.sam3_food_classifier import SAM3FoodClassifier
classifier = SAM3FoodClassifier(model_id="Reza2kn/sam3.1-nvfp4-detector-no-language")
segments = classifier.segment_and_classify_food(image, conf_threshold=0.35)
for seg in segments:
print(f"Food: {seg.food_type} (conf={seg.food_conf:.2f})")
print(f" Mask area: {seg.area} pixels")
print(f" BBox: {seg.bbox}")
Built as part of the NomNomLabel Sapiens2 Benchmark for food instance segmentation in long-form mukbang / eating content. The benchmark uses:
SAM 2: Segment Anything in Images and Videos
Quantized by Reza2kn using torchao NVFP4.
Base model
facebook/sam3.1