SAM 3.1 NVFP4 โ€” Detector (No Language), Static-6

NVFP4 quantized variant of facebook/sam3.1. Quantizes the detector backbone's vision trunk while keeping the language backbone in FP32.

Quantization Details

Property Value
Method NVFP4 (custom_nvfp4_e2m1_e4m3_scales)
Block size 16
Scale rule static_6
Quantized scope detector.* (vision trunk only)
Language backbone FP32 (kept raw for prompt accuracy)
Total parameters 874 M
Quantized parameters 472 M (54.0%)
FP32-kept parameters 402 M (46.0%)

Fidelity

Validated against source SAM 3.1 on mukbang / food video frames with Sapiens-2 human-part exclusion masking:

Threshold Mean foreground IoU Pixel agreement
conf = 0.35 0.988 0.99897

Some NVFP4 masks are visually sharper than the source FP32 output due to quantization-induced de-noising.

Files

File Description
nvfp4_model.safetensors NVFP4-packed detector tensors
sam3.1_multiplex.pt FP32 non-quantized tensors (language backbone, heads)
config.json SAM 3.1 model config
quantization_config.json NVFP4 packing metadata
quant_error_report.json Per-tensor L2 error report
tokenizer*.json / vocab.json / merges.txt Language-backbone tokenizer

Usage

# Load the quantized model using the nomnomlabel loader
from nomnomlabel.quant_loader import load_sam3_nvfp4

model = load_sam3_nvfp4("Reza2kn/sam3.1-nvfp4-detector-no-language")

# Segment food (with language prompt)
from nomnomlabel.sam3_food_classifier import SAM3FoodClassifier

classifier = SAM3FoodClassifier(model_id="Reza2kn/sam3.1-nvfp4-detector-no-language")
segments = classifier.segment_and_classify_food(image, conf_threshold=0.35)

for seg in segments:
    print(f"Food: {seg.food_type} (conf={seg.food_conf:.2f})")
    print(f"  Mask area: {seg.area} pixels")
    print(f"  BBox: {seg.bbox}")

Training / Benchmark Context

Built as part of the NomNomLabel Sapiens2 Benchmark for food instance segmentation in long-form mukbang / eating content. The benchmark uses:

  • SAM 3.1 NVFP4 (this model) for food segmentation
  • Sapiens-2 INT4-G128 for human-part segmentation and exclusion masking
  • YOLO11L FoodSeg103 as a reference food detector

Paper

SAM 2: Segment Anything in Images and Videos


Quantized by Reza2kn using torchao NVFP4.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Reza2kn/sam3.1-nvfp4-detector-no-language

Base model

facebook/sam3.1
Quantized
(4)
this model

Paper for Reza2kn/sam3.1-nvfp4-detector-no-language