Laguna XS.2 β€” IQ2_XS GGUF (edge / Jetson Orin NX 16GB)

An aggressively quantized IQ2_XS GGUF of Laguna XS.2, produced for the Poolside / Prime Intellect Laguna hackathon so the model fits and runs onboard a Jetson Orin NX 16GB as a code-as-policy generator for a quadruped robot.

⚠️ Summary: at 2.36 BPW this quant is degraded. It is not a drop-in replacement for higher-precision Laguna. Its value is narrow and specific: it is the smallest Laguna we could get to run on a 16 GB edge device and still emit structurally correct code-as-policy when paired with a validate-and-repair harness (see "Intended use" and "Evaluation").

Demo

The robot is given the command "please come closer to me and show me a heart." The onboard IQ2_XS model generates a Python policy(obs, robot), which is sandboxed, validated, and executed: the Unitree Go2 walks toward the AprilTag, stops at the configured distance, and performs the heart gesture.

Provenance

This is a derivative quantization, not a from-scratch conversion.

Lucebox/Laguna-XS.2-GGUF  (BF16 GGUF, 63 GB)        <- base / starting point
        + laguna-xs2.imatrix  (180 MB importance matrix)
        | llama.cpp quantize-only Laguna patch
        v
laguna-xs2-IQ2_XS.gguf    (9.3 GB, 2.36 BPW)        <- this model

Licensed Apache-2.0, matching the base model Lucebox/Laguna-XS.2-GGUF (which derives from poolside/Laguna-XS.2, also Apache-2.0). Apache-2.0 permits redistribution of derivatives; attribution to the base model is preserved here per its terms.

Files

File Size Notes
laguna-xs2-IQ2_XS.gguf 9.3 GB IQ2_XS, approx. 9420.53 MiB, 2.36 BPW

Quantization recipe

Built with a llama.cpp checkout patched only enough to quantize Laguna (it cannot run Laguna inference β€” use the Lucebox runtime below for that):

cd llama.cpp
./build-cuda13/bin/llama-quantize \
  --imatrix models/laguna-xs2.imatrix \
  models/laguna-xs2-bf16.gguf \
  models/laguna-xs2-IQ2_XS.gguf \
  IQ2_XS

Inference runtime

Inference is not done through stock llama.cpp. It uses Lucebox's Laguna runtime (dflash_server), built per-target.

RTX 5090 (x86_64, sm_120, CUDA 13):

./server/build-5090/dflash_server laguna-xs2-IQ2_XS.gguf \
  --host 127.0.0.1 --port 8000 \
  --max-ctx 4096 --default-max-tokens 256 \
  --hard-limit-reply-budget 0 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --model-name laguna-xs2-iq2

Jetson Orin NX 16GB (aarch64, sm_87, CUDA 12.6) β€” build:

cmake -B server/build-orin -S server \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=87 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc \
  -DDFLASH27B_ENABLE_BSA=OFF
cmake --build server/build-orin --target dflash_server -j4

Orin β€” run:

./server/build-orin/dflash_server laguna-xs2-IQ2_XS.gguf \
  --host 0.0.0.0 --port 8000 \
  --max-ctx 2048 --default-max-tokens 128 \
  --hard-limit-reply-budget 0 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --model-name laguna-xs2-iq2

It serves an OpenAI-compatible /v1/chat/completions endpoint.

Measured footprint

Platform Server memory Decode speed Settings
RTX 5090 approx. 10,284 MiB VRAM β€” --max-ctx 4096, q4_0 KV
Orin NX 16GB approx. 12.7 GiB resident (11.8 of 15.6 GB system RAM used) approx. 14.4 tok/s --max-ctx 2048, q4_0 KV, approx. 52 Β°C, 7.6 W

Fits 16 GB with margin (about 3.5 GB free).

Intended use

Unitree Go2 with Jetson Orin NX

Onboard code-as-policy generation for a robot: given a natural-language command, the model writes a small Python policy(obs, robot) function that is then AST-sandboxed, validated, and executed by a runtime that owns the robot SDK. See the companion code repo (robot policy bridge + Unitree Go2 demo).

Evaluation

Task: from a command like "please come closer to me and show me a heart", emit a valid policy(obs, robot) that approaches an AprilTag and triggers the heart gesture.

  • Structure / intent: with a tight system prompt + 2 few-shot examples, the model reliably produces the right shape and intent (approach + correct stop reason).
  • Raw single-shot validity on Orin: poor β€” roughly 1 in 4 greedy attempts parses and passes the sandbox; the rest are corrupted (garbled tokens, unterminated strings). This is expected at 2.36 BPW with q4_0 KV cache.
  • With the harness: a validate-and-repair loop (re-prompt with the parser error, up to 4 attempts, rising temperature) recovers to a valid policy in most runs; a deterministic fallback policy guarantees the system never stalls.
  • End-to-end: verified on a real Unitree Go2 (see the demo video above) β€” the robot approached the tag, stopped at the configured distance, and performed the heart gesture, both when the model succeeded and when the fallback engaged.

Takeaway: treat this quant as a component that needs a validation/repair wrapper, not as a standalone reliable code generator.

Limitations

  • 2.36 BPW degradation: frequent token corruption, no long-context reliability.
  • Narrow validated task surface (a few intents over a fixed observation schema).
  • Requires the Lucebox runtime; not compatible with stock llama.cpp inference.
Downloads last month
78
GGUF
Model size
33B params
Architecture
laguna
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for poolside-laguna-hackathon/laguna-xs2-IQ2_XS

Quantized
(1)
this model