Laguna-M.1 GGUF

GGUF quantizations of poolside/Laguna-M.1. Converted and tested with a patched fork of llama.cpp that implements LLM_ARCH_LAGUNA from scratch.

Requires the custom fork. Upstream llama.cpp does not support this architecture. Fork: https://github.com/linuxid10t/llama.cpp-add-laguna

The same fork also converts and runs the Laguna-XS.2 sibling (33B-A3B, mixed SWA + global attention, per-head attention gate, half-rotary global layers) β€” see its own repo.

Laguna-M.1 is the all-full-attention member of the family: every layer attends over the full context, uses a per-element attention gate, and applies full rotary on every layer. It has no sliding-window layers at all.

Files

File Quant Size Notes
Laguna-M.1-f16.gguf f16 ~420 GB Full precision, reference
Laguna-M.1-Q4_K_M.gguf Q4_K_M ~125 GB Recommended for most users
Laguna-M.1-IQ4_XS.gguf IQ4_XS ~115 GB Smallest practical quant

(Sizes are approximate. This is a ~226B-parameter MoE β€” even Q4_K_M is ~125 GB, so running it requires either a multi-GPU box with enough VRAM, or a machine with plentiful RAM and the patience for mmap'd CPU inference.)

Usage

git clone https://github.com/linuxid10t/llama.cpp-add-laguna
cd llama.cpp-add-laguna && cmake -B build && cmake --build build -j$(nproc)

./build/bin/llama-cli \
  -m Laguna-M.1-Q4_K_M.gguf \
  --ctx-size 262144 \
  --temp 0 \
  -p "The capital of France is"

Chat / thinking mode β€” --jinja required

Laguna-M.1 ships the laguna_glm_thinking_v4 chat template. The fork's built-in auto-detector recognizes the v5 marker used by Laguna-XS.2, so it does not auto-match M.1 and the CLI will report "custom template not supported". Pass --jinja to use the template embedded in the GGUF (the converter resolves and writes it directly):

# Thinking on (default) β€” model prefills <think> and generates a reasoning trace
./build/bin/llama-cli -m Laguna-M.1-Q4_K_M.gguf -cnv --jinja --ctx-size 32768

# Thinking off β€” direct answer, no reasoning trace
./build/bin/llama-cli -m Laguna-M.1-Q4_K_M.gguf -cnv --jinja --ctx-size 32768 --reasoning off

The thinking-mode prefix, EOT token (</assistant>, token 24), and stop-word stripping are all handled automatically once --jinja is supplied.

Architecture

Per config.json (Laguna-M.1 has no sliding-window layers β€” sliding_window = 0, layer_types all full_attention):

Property Value
Parameters ~226B total, ~22B active per token
Layers 70 (3 dense + 67 sparse)
Attention heads 64 (uniform β€” no per-layer head counts)
KV heads 8 (GQA)
Head dim 128
Q/K norm RMSNorm per head
Attention gate Per-element softplus gate on SDPA output, applied on all layers
Sliding window None (full attention on every layer)
Experts 256 routed (top-16) + 1 shared
Dense layers Layers 0–2 (mlp_only_layers = [0, 1, 2])
Dense FFN intermediate 16,384
Expert FFN intermediate 1,024 (routed and shared)
MoE router Sigmoid + e_score_correction_bias (added at selection, not routing)
Routing scale moe_routed_scaling_factor = 1.0, L1-normalized weights (norm_topk_prob)
RoPE YaRN: base 500K, factor 64, original_max 4096, Ξ²_fast 64, Ξ²_slow 1, attention_factor 1.0
Rotary Full rotary on every layer (partial_rotary_factor = 1.0)
Context length 262,144 tokens
Vocab 100,352
Norm eps 1e-6

Implementation Notes

Attention gate (per-element): Every layer has a self_attn.g_proj that projects the hidden state (4096) to num_heads Γ— head_dim = 64 Γ— 128 = 8192 values β€” one gate per attention output element, not one per head. This is softplus-gated and multiplied element-wise into the SDPA output before o_proj. The converter detects the mode from the actual g_proj tensor shape (per-element ⟺ out_features == n_head Γ— head_dim) and writes the attention.gate_per_head key accordingly (here false). This matters because config gating is written inconsistently across the family (a mode string on M.1, a bool on XS.2), so the tensor shape is the only reliable source β€” laguna.cpp declares the gate tensor as {4096, 8192} and applies a plain ggml_mul. (Laguna-XS.2 instead uses per-head gating: g_proj outputs num_heads scalars, broadcast across head_dim.)

Full rotary, single RoPE: With partial_rotary_factor = 1.0, every layer rotates the entire head dimension (128). The converter writes rope.dimension_count = 128; because there are no sliding layers, rope.dimension_count_swa is omitted. There is only one rope config (full_attention), used on all 70 layers.

No SWA path: sliding_window = 0 and every layer is full_attention, so is_swa_any() is false and the model sets swa_type = NONE. The graph builder takes the plain build_attn_inp_kv() / build_attn() path on every layer rather than the iswa (interleaved sliding-window) path used by XS.2. (This is required by llama_model::create_memory, which asserts swa_type != NONE iff there are SWA layers β€” leaving it STANDARD with zero SWA layers crashes at context creation.)

MoE routing: Top-16 of 256 experts per token, plus one shared expert on every sparse layer. Router logits pass through sigmoid (not softmax), biased by a per-expert e_score_correction_bias that is added only during top-k selection, not during weight computation. Routing weights are L1-normalized (norm_topk_prob = true) before scaling by moe_routed_scaling_factor (= 1.0 on M.1; = 2.5 on XS.2). Layers 0–2 are dense FFN (intermediate_size = 16384); layers 3–69 are sparse.

Tensor layout: Same split-per-expert checkpoint layout and MoE structure as Laguna-XS.2 β€” the converter buffers the per-expert gate_proj / up_proj / down_proj tensors and stacks them into merged 3D tensors, with a single shared_expert and e_score_correction_bias located under experts (not gate).

Stop token: </assistant> is token 24, a regular vocabulary token (not a special token). The converter registers it as an EOT (config eos_token_id = [2, 24]), and the fork adds it to antiprompt so the stop-word erase logic strips it from streaming output.

Tested

  • f16 GGUF loads and runs under the fork (architecture, tensor, and KV metadata all validated through a successful conversion + load) βœ“

⚠️ No quality validation yet. Numerical validation against HF Transformers requires CUDA/ROCm and was not performed. On CPU (mmap), the ~226B MoE is I/O-bound β€” roughly tens of seconds per token on 60 GB RAM β€” so prompt-quality spot checks were impractical. Treat these weights as structurally correct and loadable, not as a verified-quality release.

Known Limitations

  • No SWA, no per-layer head counts, no partial rotary β€” these XS.2 features do not apply to M.1; don't expect the same command-line behavior (e.g. there is nothing for the sliding-window KV cache to do here).
  • Chat template requires --jinja β€” see above.
  • Numerical validation against HF Transformers not yet done (requires CUDA/ROCm).
  • Q4_K_M vs f16 top-1 token agreement not formally checked (the 420 GB f16 GGUF exceeds available RAM for a full comparison).
Downloads last month
400
GGUF
Model size
226B params
Architecture
laguna
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for linuxid10t/Laguna-M.1-GGUF

Quantized
(12)
this model