Instructions to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="poolside-laguna-hackathon/laguna-xs2-IQ2_XS", filename="laguna-xs2-IQ2_XS.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS # Run inference directly in the terminal: llama-cli -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS # Run inference directly in the terminal: llama-cli -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS # Run inference directly in the terminal: ./llama-cli -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS # Run inference directly in the terminal: ./build/bin/llama-cli -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
Use Docker
docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
- LM Studio
- Jan
- vLLM
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "poolside-laguna-hackathon/laguna-xs2-IQ2_XS" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "poolside-laguna-hackathon/laguna-xs2-IQ2_XS", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
- Ollama
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with Ollama:
ollama run hf.co/poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
- Unsloth Studio new
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for poolside-laguna-hackathon/laguna-xs2-IQ2_XS to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for poolside-laguna-hackathon/laguna-xs2-IQ2_XS to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for poolside-laguna-hackathon/laguna-xs2-IQ2_XS to start chatting
- Docker Model Runner
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with Docker Model Runner:
docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
- Lemonade
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
Run and chat with the model
lemonade run user.laguna-xs2-IQ2_XS-IQ2_XS
List all available models
lemonade list
Laguna XS.2 β IQ2_XS GGUF (edge / Jetson Orin NX 16GB)
An aggressively quantized IQ2_XS GGUF of Laguna XS.2, produced for the Poolside / Prime Intellect Laguna hackathon so the model fits and runs onboard a Jetson Orin NX 16GB as a code-as-policy generator for a quadruped robot.
β οΈ Summary: at 2.36 BPW this quant is degraded. It is not a drop-in replacement for higher-precision Laguna. Its value is narrow and specific: it is the smallest Laguna we could get to run on a 16 GB edge device and still emit structurally correct code-as-policy when paired with a validate-and-repair harness (see "Intended use" and "Evaluation").
Demo
The robot is given the command "please come closer to me and show me a heart." The
onboard IQ2_XS model generates a Python policy(obs, robot), which is sandboxed,
validated, and executed: the Unitree Go2 walks toward the AprilTag, stops at the
configured distance, and performs the heart gesture.
Provenance
This is a derivative quantization, not a from-scratch conversion.
Lucebox/Laguna-XS.2-GGUF (BF16 GGUF, 63 GB) <- base / starting point
+ laguna-xs2.imatrix (180 MB importance matrix)
| llama.cpp quantize-only Laguna patch
v
laguna-xs2-IQ2_XS.gguf (9.3 GB, 2.36 BPW) <- this model
Licensed Apache-2.0, matching the base model
Lucebox/Laguna-XS.2-GGUF (which
derives from poolside/Laguna-XS.2,
also Apache-2.0). Apache-2.0 permits redistribution of derivatives; attribution to the
base model is preserved here per its terms.
Files
| File | Size | Notes |
|---|---|---|
laguna-xs2-IQ2_XS.gguf |
9.3 GB | IQ2_XS, approx. 9420.53 MiB, 2.36 BPW |
Quantization recipe
Built with a llama.cpp checkout patched only enough to quantize Laguna (it cannot run Laguna inference β use the Lucebox runtime below for that):
cd llama.cpp
./build-cuda13/bin/llama-quantize \
--imatrix models/laguna-xs2.imatrix \
models/laguna-xs2-bf16.gguf \
models/laguna-xs2-IQ2_XS.gguf \
IQ2_XS
Inference runtime
Inference is not done through stock llama.cpp. It uses Lucebox's Laguna runtime
(dflash_server), built per-target.
RTX 5090 (x86_64, sm_120, CUDA 13):
./server/build-5090/dflash_server laguna-xs2-IQ2_XS.gguf \
--host 127.0.0.1 --port 8000 \
--max-ctx 4096 --default-max-tokens 256 \
--hard-limit-reply-budget 0 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--model-name laguna-xs2-iq2
Jetson Orin NX 16GB (aarch64, sm_87, CUDA 12.6) β build:
cmake -B server/build-orin -S server \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=87 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc \
-DDFLASH27B_ENABLE_BSA=OFF
cmake --build server/build-orin --target dflash_server -j4
Orin β run:
./server/build-orin/dflash_server laguna-xs2-IQ2_XS.gguf \
--host 0.0.0.0 --port 8000 \
--max-ctx 2048 --default-max-tokens 128 \
--hard-limit-reply-budget 0 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--model-name laguna-xs2-iq2
It serves an OpenAI-compatible /v1/chat/completions endpoint.
Measured footprint
| Platform | Server memory | Decode speed | Settings |
|---|---|---|---|
| RTX 5090 | approx. 10,284 MiB VRAM | β | --max-ctx 4096, q4_0 KV |
| Orin NX 16GB | approx. 12.7 GiB resident (11.8 of 15.6 GB system RAM used) | approx. 14.4 tok/s | --max-ctx 2048, q4_0 KV, approx. 52 Β°C, 7.6 W |
Fits 16 GB with margin (about 3.5 GB free).
Intended use
Onboard code-as-policy generation for a robot: given a natural-language command,
the model writes a small Python policy(obs, robot) function that is then
AST-sandboxed, validated, and executed by a runtime that owns the robot SDK. See the
companion code repo (robot policy bridge + Unitree Go2 demo).
Evaluation
Task: from a command like "please come closer to me and show me a heart", emit a
valid policy(obs, robot) that approaches an AprilTag and triggers the heart gesture.
- Structure / intent: with a tight system prompt + 2 few-shot examples, the model reliably produces the right shape and intent (approach + correct stop reason).
- Raw single-shot validity on Orin: poor β roughly 1 in 4 greedy attempts parses and passes the sandbox; the rest are corrupted (garbled tokens, unterminated strings). This is expected at 2.36 BPW with q4_0 KV cache.
- With the harness: a validate-and-repair loop (re-prompt with the parser error, up to 4 attempts, rising temperature) recovers to a valid policy in most runs; a deterministic fallback policy guarantees the system never stalls.
- End-to-end: verified on a real Unitree Go2 (see the demo video above) β the robot approached the tag, stopped at the configured distance, and performed the heart gesture, both when the model succeeded and when the fallback engaged.
Takeaway: treat this quant as a component that needs a validation/repair wrapper, not as a standalone reliable code generator.
Limitations
- 2.36 BPW degradation: frequent token corruption, no long-context reliability.
- Narrow validated task surface (a few intents over a fixed observation schema).
- Requires the Lucebox runtime; not compatible with stock llama.cpp inference.
- Downloads last month
- 78
2-bit
