rskill-locateanything-3b-nf4
OpenRAL rSkill - NVIDIA LocateAnything-3B packaged as an NF4 bitsandbytes PyTorch detector rSkill. It localizes queried objects, text, GUI elements, and points from RGB images and emits perception results only. No actuators. The wrapped upstream weights are NVIDIA non-commercial research/evaluation weights.
Quick Start
OPENRAL_ACCEPT_NONCOMMERCIAL=1 ral skill install hf://OpenRAL/rskill-locateanything-3b-nf4
import os
os.environ["OPENRAL_ACCEPT_NONCOMMERCIAL"] = "1"
os.environ["OPENRAL_ALLOW_REMOTE_CODE"] = "1"
from openral_core.schemas import RSkillManifest
manifest = RSkillManifest.from_yaml("rskills/locateanything-3b-nf4/rskill.yaml")
assert manifest.kind == "detector"
assert manifest.quantization.extra["scheme"] == "nf4"
The upstream model uses Transformers custom code (trust_remote_code=True).
OpenRAL should execute it only after the operator has accepted the remote-code
risk for this specific package.
What It Does
LocateAnything is an open-vocabulary visual-grounding model. Given an RGB image and a natural-language query, it can return structured coordinate tokens for object detection, phrase grounding, dense detection, scene text localization, GUI element grounding, and point localization. The model card describes a hybrid mode that combines parallel box decoding with autoregressive fallback for format irregularity or spatial ambiguity.
This rSkill declares kind: detector because it is a pure perception producer:
it consumes camera frames and text/object queries, emits localization metadata,
and never drives ros2_control joints.
Runtime Status
The package is intentionally marked runtime: pytorch with NF4 metadata:
OpenRAL's current ADR-0037 camera-tee detector runner is ONNX/TensorRT-oriented,
while upstream LocateAnything is a Transformers custom-code model. The HF rSkill
repo therefore contains the quantized PyTorch weights and upstream custom-code
sidecars needed by AutoModel.from_pretrained(..., trust_remote_code=True).
A future OpenRAL adapter still needs to bridge LocateAnything's text output
(<box> x1, y1, x2, y2 </box> and point tokens) into ObjectsMetadata before
this can be used as a drop-in replacement for the RT-DETR ONNX detector rskills.
Until then, this package is a detector artifact and manifest, not a validated
camera-tee runtime path.
Upstream Model And Training
| Field | Value |
|---|---|
| Source repo | nvidia/LocateAnything-3B |
| Source revision | 7a81d810571dc5f244b2f0b6868128f24b1cbd85 |
| Paper | arxiv:2605.27365 |
| Architecture | LocateAnythingForConditionalGeneration; MoonViT vision tower plus Qwen2.5-3B text backbone |
| Runtime | Transformers custom code, BF16 upstream; this rSkill ships NF4 packed weights |
| Training scale | 12M unique images, about 140M natural-language queries, and 785M bounding boxes per upstream card |
| License | NVIDIA License, non-commercial research/evaluation use |
The upstream card lists supported tasks including general object detection, dense object detection, referring-expression grounding, scene text detection, layout/OCR grounding, GUI grounding, and point localization.
Supported Robots And Embodiments
This detector is embodiment-agnostic. The only declared hardware requirement is
an RGB camera stream of at least 640 x 480. All in-tree OpenRAL embodiment tags
are listed in rskill.yaml so robots with a compatible RGB sensor can install
the package once a PyTorch detector adapter is available.
Sensors And Observation Contract
| Direction | Key | Modality | Shape / format | Notes |
|---|---|---|---|---|
| in | any RGB camera | RGB image | min 640 x 480 | vla_feature_key is intentionally omitted |
| in | detector query | text | natural language | object names, referring phrases, OCR/layout queries, GUI targets, point targets |
| preprocessing | LocateAnythingProcessor | image + text | dynamic visual tokens | upstream processor uses 14 px patches, 2 x 2 merge kernel, and 25,600 token input limit |
| out | localization tokens | text | boxes or points | adapter must parse coordinate tokens into OpenRAL ObjectsMetadata |
The model emits no action chunks and has no proprioception contract.
Manifest Summary
| Field | Value |
|---|---|
name |
OpenRAL/rskill-locateanything-3b-nf4 |
version |
0.1.0 |
license |
nvidia_non_commercial |
role / kind |
s1 / detector |
runtime |
pytorch |
quantization.dtype |
int4 |
quantization.extra.scheme |
nf4 |
weights_uri |
hf://OpenRAL/rskill-locateanything-3b-nf4 |
source_repo |
hf://nvidia/LocateAnything-3B@7a81d810571dc5f244b2f0b6868128f24b1cbd85 |
latency_budget.per_chunk_ms |
1000 ms |
actions |
detect |
The published HF repo includes model.safetensors, quantization_metadata.json,
upstream tokenizer/processor/config sidecars, and the upstream custom-code files
required by Transformers.
Quantization
NF4 packing follows the OpenRAL bitsandbytes rule used by
tools/quantize_rskill.py: large Linear modules with at least 4,000,000
weight elements are rewritten to bnb.nn.Linear4bit with BF16 compute, while
smaller heads stay in BF16. The manifest records this as dtype: int4 plus
extra.scheme: nf4 because the schema enum represents storage dtype rather than
bitsandbytes' named quantization scheme.
Reproduce from this worktree:
cd /home/allopart/workspace/openral-feat-more-rskills
OPENRAL_TRUSTED_REMOTE_CODE_ORGS=nvidia \
/home/allopart/workspace/openral/.venv/bin/python tools/quantize_rskill.py \
--source nvidia/LocateAnything-3B \
--target OpenRAL/rskill-locateanything-3b-nf4 \
--loader transformers \
--trust-remote-code \
--scheme nf4 \
--device cuda \
--skip-upload \
--keep-temp
Use the private-only tools/rskill_publisher.py path, or an equivalent
HfApi.create_repo(..., private=True) plus upload_folder, for publication.
The quantizer's generic upload helper is not used for this package because this
rSkill must remain private in the OpenRAL organization.
License
The rSkill package metadata and README are OpenRAL project files. The wrapped
LocateAnything weights and upstream custom-code sidecars are governed by the
NVIDIA License from nvidia/LocateAnything-3B; Section 3.3 limits use to
non-commercial research or evaluation except for NVIDIA and its affiliates.
OpenRAL should require explicit non-commercial acceptance before install or
activation.
- Downloads last month
- -