SAM 3.1 NVFP4 — Detector (No Language), Static-6

NVFP4 quantized variant of facebook/sam3.1. Quantizes the detector backbone's vision trunk while keeping the language backbone in FP32.

Quantization Details

Property	Value
Method	NVFP4 (`custom_nvfp4_e2m1_e4m3_scales`)
Block size	16
Scale rule	`static_6`
Quantized scope	`detector.*` (vision trunk only)
Language backbone	FP32 (kept raw for prompt accuracy)
Total parameters	874 M
Quantized parameters	472 M (54.0%)
FP32-kept parameters	402 M (46.0%)

Fidelity

Validated against source SAM 3.1 on mukbang / food video frames with Sapiens-2 human-part exclusion masking:

Threshold	Mean foreground IoU	Pixel agreement
conf = 0.35	0.988	0.99897

Some NVFP4 masks are visually sharper than the source FP32 output due to quantization-induced de-noising.

Files

File	Description
`nvfp4_model.safetensors`	NVFP4-packed detector tensors
`sam3.1_multiplex.pt`	FP32 non-quantized tensors (language backbone, heads)
`config.json`	SAM 3.1 model config
`quantization_config.json`	NVFP4 packing metadata
`quant_error_report.json`	Per-tensor L2 error report
`tokenizer*.json / vocab.json / merges.txt`	Language-backbone tokenizer

Usage

# Load the quantized model using the nomnomlabel loader
from nomnomlabel.quant_loader import load_sam3_nvfp4

model = load_sam3_nvfp4("Reza2kn/sam3.1-nvfp4-detector-no-language")

# Segment food (with language prompt)
from nomnomlabel.sam3_food_classifier import SAM3FoodClassifier

classifier = SAM3FoodClassifier(model_id="Reza2kn/sam3.1-nvfp4-detector-no-language")
segments = classifier.segment_and_classify_food(image, conf_threshold=0.35)

for seg in segments:
    print(f"Food: {seg.food_type} (conf={seg.food_conf:.2f})")
    print(f"  Mask area: {seg.area} pixels")
    print(f"  BBox: {seg.bbox}")

Training / Benchmark Context

Built as part of the NomNomLabel Sapiens2 Benchmark for food instance segmentation in long-form mukbang / eating content. The benchmark uses:

SAM 3.1 NVFP4 (this model) for food segmentation
Sapiens-2 INT4-G128 for human-part segmentation and exclusion masking
YOLO11L FoodSeg103 as a reference food detector

Paper

SAM 2: Segment Anything in Images and Videos

Quantized by Reza2kn using torchao NVFP4.

Downloads last month: 1

Model tree for Reza2kn/sam3.1-nvfp4-detector-no-language

Base model

facebook/sam3.1

Quantized

(4)

this model

Paper for Reza2kn/sam3.1-nvfp4-detector-no-language

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 123