Instructions to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="poolside-laguna-hackathon/laguna-xs2-IQ2_XS",
	filename="laguna-xs2-IQ2_XS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
# Run inference directly in the terminal:
llama-cli -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
# Run inference directly in the terminal:
llama-cli -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
# Run inference directly in the terminal:
./llama-cli -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
# Run inference directly in the terminal:
./build/bin/llama-cli -hf poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS

Use Docker

docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS

LM Studio
Jan

vLLM

How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "poolside-laguna-hackathon/laguna-xs2-IQ2_XS"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside-laguna-hackathon/laguna-xs2-IQ2_XS",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS

Ollama
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with Ollama:
```
ollama run hf.co/poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
```

Unsloth Studio new

How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for poolside-laguna-hackathon/laguna-xs2-IQ2_XS to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for poolside-laguna-hackathon/laguna-xs2-IQ2_XS to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for poolside-laguna-hackathon/laguna-xs2-IQ2_XS to start chatting

Docker Model Runner
How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with Docker Model Runner:
```
docker model run hf.co/poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS
```

Lemonade

How to use poolside-laguna-hackathon/laguna-xs2-IQ2_XS with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull poolside-laguna-hackathon/laguna-xs2-IQ2_XS:IQ2_XS

Run and chat with the model

lemonade run user.laguna-xs2-IQ2_XS-IQ2_XS

List all available models

lemonade list

Laguna XS.2 — IQ2_XS GGUF (edge / Jetson Orin NX 16GB)

An aggressively quantized IQ2_XS GGUF of Laguna XS.2, produced for the Poolside / Prime Intellect Laguna hackathon so the model fits and runs onboard a Jetson Orin NX 16GB as a code-as-policy generator for a quadruped robot.

⚠️ Summary: at 2.36 BPW this quant is degraded. It is not a drop-in replacement for higher-precision Laguna. Its value is narrow and specific: it is the smallest Laguna we could get to run on a 16 GB edge device and still emit structurally correct code-as-policy when paired with a validate-and-repair harness (see "Intended use" and "Evaluation").

Demo

The robot is given the command "please come closer to me and show me a heart." The onboard IQ2_XS model generates a Python policy(obs, robot), which is sandboxed, validated, and executed: the Unitree Go2 walks toward the AprilTag, stops at the configured distance, and performs the heart gesture.

Provenance

This is a derivative quantization, not a from-scratch conversion.

Lucebox/Laguna-XS.2-GGUF  (BF16 GGUF, 63 GB)        <- base / starting point
        + laguna-xs2.imatrix  (180 MB importance matrix)
        | llama.cpp quantize-only Laguna patch
        v
laguna-xs2-IQ2_XS.gguf    (9.3 GB, 2.36 BPW)        <- this model

Licensed Apache-2.0, matching the base model Lucebox/Laguna-XS.2-GGUF (which derives from poolside/Laguna-XS.2, also Apache-2.0). Apache-2.0 permits redistribution of derivatives; attribution to the base model is preserved here per its terms.

Files

File	Size	Notes
`laguna-xs2-IQ2_XS.gguf`	9.3 GB	IQ2_XS, approx. 9420.53 MiB, 2.36 BPW

Quantization recipe

Built with a llama.cpp checkout patched only enough to quantize Laguna (it cannot run Laguna inference — use the Lucebox runtime below for that):

cd llama.cpp
./build-cuda13/bin/llama-quantize \
  --imatrix models/laguna-xs2.imatrix \
  models/laguna-xs2-bf16.gguf \
  models/laguna-xs2-IQ2_XS.gguf \
  IQ2_XS

Inference runtime

Inference is not done through stock llama.cpp. It uses Lucebox's Laguna runtime (dflash_server), built per-target.

RTX 5090 (x86_64, sm_120, CUDA 13):

./server/build-5090/dflash_server laguna-xs2-IQ2_XS.gguf \
  --host 127.0.0.1 --port 8000 \
  --max-ctx 4096 --default-max-tokens 256 \
  --hard-limit-reply-budget 0 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --model-name laguna-xs2-iq2

Jetson Orin NX 16GB (aarch64, sm_87, CUDA 12.6) — build:

cmake -B server/build-orin -S server \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=87 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc \
  -DDFLASH27B_ENABLE_BSA=OFF
cmake --build server/build-orin --target dflash_server -j4

Orin — run:

./server/build-orin/dflash_server laguna-xs2-IQ2_XS.gguf \
  --host 0.0.0.0 --port 8000 \
  --max-ctx 2048 --default-max-tokens 128 \
  --hard-limit-reply-budget 0 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --model-name laguna-xs2-iq2

It serves an OpenAI-compatible /v1/chat/completions endpoint.

Measured footprint

Platform	Server memory	Decode speed	Settings
RTX 5090	approx. 10,284 MiB VRAM	—	`--max-ctx 4096`, q4_0 KV
Orin NX 16GB	approx. 12.7 GiB resident (11.8 of 15.6 GB system RAM used)	approx. 14.4 tok/s	`--max-ctx 2048`, q4_0 KV, approx. 52 °C, 7.6 W

Fits 16 GB with margin (about 3.5 GB free).

Intended use

Onboard code-as-policy generation for a robot: given a natural-language command, the model writes a small Python policy(obs, robot) function that is then AST-sandboxed, validated, and executed by a runtime that owns the robot SDK. See the companion code repo (robot policy bridge + Unitree Go2 demo).

Evaluation

Task: from a command like "please come closer to me and show me a heart", emit a valid policy(obs, robot) that approaches an AprilTag and triggers the heart gesture.

Structure / intent: with a tight system prompt + 2 few-shot examples, the model reliably produces the right shape and intent (approach + correct stop reason).
Raw single-shot validity on Orin: poor — roughly 1 in 4 greedy attempts parses and passes the sandbox; the rest are corrupted (garbled tokens, unterminated strings). This is expected at 2.36 BPW with q4_0 KV cache.
With the harness: a validate-and-repair loop (re-prompt with the parser error, up to 4 attempts, rising temperature) recovers to a valid policy in most runs; a deterministic fallback policy guarantees the system never stalls.
End-to-end: verified on a real Unitree Go2 (see the demo video above) — the robot approached the tag, stopped at the configured distance, and performed the heart gesture, both when the model succeeded and when the fallback engaged.

Takeaway: treat this quant as a component that needs a validation/repair wrapper, not as a standalone reliable code generator.

Limitations

2.36 BPW degradation: frequent token corruption, no long-context reliability.
Narrow validated task surface (a few intents over a fixed observation schema).
Requires the Lucebox runtime; not compatible with stock llama.cpp inference.

Downloads last month: 78

GGUF

Model size

33B params

Architecture

laguna

Hardware compatibility

2-bit

Model tree for poolside-laguna-hackathon/laguna-xs2-IQ2_XS

Base model

poolside/Laguna-XS.2

Quantized

Lucebox/Laguna-XS.2-GGUF

Quantized

(1)

this model