rskill-locateanything-3b-nf4

OpenRAL rSkill - NVIDIA LocateAnything-3B packaged as an NF4 bitsandbytes PyTorch detector rSkill. It localizes queried objects, text, GUI elements, and points from RGB images and emits perception results only. No actuators. The wrapped upstream weights are NVIDIA non-commercial research/evaluation weights.

Quick Start

OPENRAL_ACCEPT_NONCOMMERCIAL=1 ral skill install hf://OpenRAL/rskill-locateanything-3b-nf4
import os

os.environ["OPENRAL_ACCEPT_NONCOMMERCIAL"] = "1"
os.environ["OPENRAL_ALLOW_REMOTE_CODE"] = "1"

from openral_core.schemas import RSkillManifest

manifest = RSkillManifest.from_yaml("rskills/locateanything-3b-nf4/rskill.yaml")
assert manifest.kind == "detector"
assert manifest.quantization.extra["scheme"] == "nf4"

The upstream model uses Transformers custom code (trust_remote_code=True). OpenRAL should execute it only after the operator has accepted the remote-code risk for this specific package.

What It Does

LocateAnything is an open-vocabulary visual-grounding model. Given an RGB image and a natural-language query, it can return structured coordinate tokens for object detection, phrase grounding, dense detection, scene text localization, GUI element grounding, and point localization. The model card describes a hybrid mode that combines parallel box decoding with autoregressive fallback for format irregularity or spatial ambiguity.

This rSkill declares kind: detector because it is a pure perception producer: it consumes camera frames and text/object queries, emits localization metadata, and never drives ros2_control joints.

Runtime Status

The package is intentionally marked runtime: pytorch with NF4 metadata: OpenRAL's current ADR-0037 camera-tee detector runner is ONNX/TensorRT-oriented, while upstream LocateAnything is a Transformers custom-code model. The HF rSkill repo therefore contains the quantized PyTorch weights and upstream custom-code sidecars needed by AutoModel.from_pretrained(..., trust_remote_code=True).

A future OpenRAL adapter still needs to bridge LocateAnything's text output (<box> x1, y1, x2, y2 </box> and point tokens) into ObjectsMetadata before this can be used as a drop-in replacement for the RT-DETR ONNX detector rskills. Until then, this package is a detector artifact and manifest, not a validated camera-tee runtime path.

Upstream Model And Training

Field Value
Source repo nvidia/LocateAnything-3B
Source revision 7a81d810571dc5f244b2f0b6868128f24b1cbd85
Paper arxiv:2605.27365
Architecture LocateAnythingForConditionalGeneration; MoonViT vision tower plus Qwen2.5-3B text backbone
Runtime Transformers custom code, BF16 upstream; this rSkill ships NF4 packed weights
Training scale 12M unique images, about 140M natural-language queries, and 785M bounding boxes per upstream card
License NVIDIA License, non-commercial research/evaluation use

The upstream card lists supported tasks including general object detection, dense object detection, referring-expression grounding, scene text detection, layout/OCR grounding, GUI grounding, and point localization.

Supported Robots And Embodiments

This detector is embodiment-agnostic. The only declared hardware requirement is an RGB camera stream of at least 640 x 480. All in-tree OpenRAL embodiment tags are listed in rskill.yaml so robots with a compatible RGB sensor can install the package once a PyTorch detector adapter is available.

Sensors And Observation Contract

Direction Key Modality Shape / format Notes
in any RGB camera RGB image min 640 x 480 vla_feature_key is intentionally omitted
in detector query text natural language object names, referring phrases, OCR/layout queries, GUI targets, point targets
preprocessing LocateAnythingProcessor image + text dynamic visual tokens upstream processor uses 14 px patches, 2 x 2 merge kernel, and 25,600 token input limit
out localization tokens text boxes or points adapter must parse coordinate tokens into OpenRAL ObjectsMetadata

The model emits no action chunks and has no proprioception contract.

Manifest Summary

Field Value
name OpenRAL/rskill-locateanything-3b-nf4
version 0.1.0
license nvidia_non_commercial
role / kind s1 / detector
runtime pytorch
quantization.dtype int4
quantization.extra.scheme nf4
weights_uri hf://OpenRAL/rskill-locateanything-3b-nf4
source_repo hf://nvidia/LocateAnything-3B@7a81d810571dc5f244b2f0b6868128f24b1cbd85
latency_budget.per_chunk_ms 1000 ms
actions detect

The published HF repo includes model.safetensors, quantization_metadata.json, upstream tokenizer/processor/config sidecars, and the upstream custom-code files required by Transformers.

Quantization

NF4 packing follows the OpenRAL bitsandbytes rule used by tools/quantize_rskill.py: large Linear modules with at least 4,000,000 weight elements are rewritten to bnb.nn.Linear4bit with BF16 compute, while smaller heads stay in BF16. The manifest records this as dtype: int4 plus extra.scheme: nf4 because the schema enum represents storage dtype rather than bitsandbytes' named quantization scheme.

Reproduce from this worktree:

cd /home/allopart/workspace/openral-feat-more-rskills
OPENRAL_TRUSTED_REMOTE_CODE_ORGS=nvidia \
  /home/allopart/workspace/openral/.venv/bin/python tools/quantize_rskill.py \
  --source nvidia/LocateAnything-3B \
  --target OpenRAL/rskill-locateanything-3b-nf4 \
  --loader transformers \
  --trust-remote-code \
  --scheme nf4 \
  --device cuda \
  --skip-upload \
  --keep-temp

Use the private-only tools/rskill_publisher.py path, or an equivalent HfApi.create_repo(..., private=True) plus upload_folder, for publication. The quantizer's generic upload helper is not used for this package because this rSkill must remain private in the OpenRAL organization.

License

The rSkill package metadata and README are OpenRAL project files. The wrapped LocateAnything weights and upstream custom-code sidecars are governed by the NVIDIA License from nvidia/LocateAnything-3B; Section 3.3 limits use to non-commercial research or evaluation except for NVIDIA and its affiliates. OpenRAL should require explicit non-commercial acceptance before install or activation.

Downloads last month
-
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenRAL/rskill-locateanything-3b-nf4

Base model

Qwen/Qwen2.5-3B
Finetuned
(5)
this model

Paper for OpenRAL/rskill-locateanything-3b-nf4