LocateAnything-3B-NVFP4A16

NVFP4 quantization of nvidia/LocateAnything-3B — a visual-grounding VLM (Qwen2.5-3B-Instruct LLM + Eagle/MoonViT vision encoder) for referring, detection, pointing and layout/OCR localization.

Variant: NVFP4 weight-only (W4A16) — 4-bit float weights, group size 16, per-group FP8 (e4m3) scales + per-tensor FP32 global scales; activations stay BF16 Disk size: ~3.5 GB weights (vs ~7.2 GB BF16, ~2.05×) Quantized by: sahilchachra Tooling: llm-compressor model_free_ptq (data-free, streaming PTQ — no calibration data)

Note on what is quantized: only the language tower's linear weights are taken to NVFP4 — the 36 Qwen2 decoder layers (self_attn.{q,k,v,o}_proj, mlp.{gate,up,down}_proj}), i.e. 252 modules. The vision encoder (vision_model.*), the vision→LLM connector (mlp1.*), the token embeddings, lm_head and all norms stay at the source dtype (BF16). The headline variant name reflects the LM-tower quantization, while the on-disk size averages the NVFP4 and BF16 parts of the model.

Verification (smoke test)

This is a custom-code architecture (trust_remote_code); the checkpoint was verified at the weight level:

Structure: 252 LM-tower modules carry weight_packed (uint8 FP4) + weight_scale (FP8 e4m3, per group of 16) + weight_global_scale (FP32); the vision tower, connector, embeddings and lm_head remain BF16 and byte-identical to the source.
NVFP4 decompression round-trip: dequantizing the packed LM weights reproduces the originals with ~9% per-element mean relative error — the expected fidelity of 4-bit FP4 (matmul error-averaging makes the model-level impact much smaller; activations stay BF16).
Format: nvfp4-pack-quantized (compressed-tensors), standard per-module layout.

Quantization was performed on an NVIDIA Thor (Blackwell, native NVFP4; 14-core ARM aarch64, 122 GB unified memory, JetPack/L4T R38.4, CUDA 13.0).

For inference, use the base model's runtime — it ships custom modeling code and (per the upstream repo) expects transformers pinned for compatibility. NVFP4 weights need a runtime with compressed-tensors support on a Blackwell GPU. Not formally benchmarked for quality.

Usage

Load with trust_remote_code (the architecture is defined in the repo's bundled modeling files):

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained(
    "sahilchachra/LocateAnything-3B-NVFP4A16",
    trust_remote_code=True,
    device_map="cuda",
)
processor = AutoProcessor.from_pretrained(
    "sahilchachra/LocateAnything-3B-NVFP4A16", trust_remote_code=True
)

See nvidia/LocateAnything-3B for the full prompt format and grounding examples (output boxes are normalized to 0..1000).

Notes

Weight-only NVFP4 (W4A16): LM-tower weights are 4-bit, activations and all other modules remain BF16.
All upstream custom code, processor, tokenizer and LICENSE are carried over unchanged.
Smoke-tested at the weight level (structure + decompression); not a quality benchmark.