LocateAnything-3B-NVFP4A16
NVFP4 quantization of nvidia/LocateAnything-3B — a visual-grounding VLM (Qwen2.5-3B-Instruct LLM + Eagle/MoonViT vision encoder) for referring, detection, pointing and layout/OCR localization.
Variant: NVFP4 weight-only (W4A16) — 4-bit float weights, group size 16, per-group FP8 (e4m3) scales + per-tensor FP32 global scales; activations stay BF16
Disk size: ~3.5 GB weights (vs ~7.2 GB BF16, ~2.05×)
Quantized by: sahilchachra
Tooling: llm-compressor model_free_ptq (data-free, streaming PTQ — no calibration data)
Note on what is quantized: only the language tower's linear weights are taken to NVFP4 — the 36 Qwen2 decoder layers (
self_attn.{q,k,v,o}_proj,mlp.{gate,up,down}_proj}), i.e. 252 modules. The vision encoder (vision_model.*), the vision→LLM connector (mlp1.*), the token embeddings,lm_headand all norms stay at the source dtype (BF16). The headline variant name reflects the LM-tower quantization, while the on-disk size averages the NVFP4 and BF16 parts of the model.
Verification (smoke test)
This is a custom-code architecture (trust_remote_code); the checkpoint was verified at the weight
level:
- Structure: 252 LM-tower modules carry
weight_packed(uint8 FP4) +weight_scale(FP8 e4m3, per group of 16) +weight_global_scale(FP32); the vision tower, connector, embeddings andlm_headremain BF16 and byte-identical to the source. - NVFP4 decompression round-trip: dequantizing the packed LM weights reproduces the originals with ~9% per-element mean relative error — the expected fidelity of 4-bit FP4 (matmul error-averaging makes the model-level impact much smaller; activations stay BF16).
- Format:
nvfp4-pack-quantized(compressed-tensors), standard per-module layout.
Quantization was performed on an NVIDIA Thor (Blackwell, native NVFP4; 14-core ARM aarch64, 122 GB unified memory, JetPack/L4T R38.4, CUDA 13.0).
For inference, use the base model's runtime — it ships custom modeling code and (per the upstream repo) expects
transformerspinned for compatibility. NVFP4 weights need a runtime with compressed-tensors support on a Blackwell GPU. Not formally benchmarked for quality.
Usage
Load with trust_remote_code (the architecture is defined in the repo's bundled modeling files):
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained(
"sahilchachra/LocateAnything-3B-NVFP4A16",
trust_remote_code=True,
device_map="cuda",
)
processor = AutoProcessor.from_pretrained(
"sahilchachra/LocateAnything-3B-NVFP4A16", trust_remote_code=True
)
See nvidia/LocateAnything-3B for the full prompt format and grounding examples (output boxes are normalized to 0..1000).
Notes
- Weight-only NVFP4 (W4A16): LM-tower weights are 4-bit, activations and all other modules remain BF16.
- All upstream custom code, processor, tokenizer and
LICENSEare carried over unchanged. - Smoke-tested at the weight level (structure + decompression); not a quality benchmark.
Original model
See nvidia/LocateAnything-3B for architecture, intended use, capabilities and the NVIDIA license (inherited).
- Downloads last month
- 81