Code Writer V2 — Obliterated

"We are such stuff as programs are made on, and our little runtime is rounded with a sleep."

There are models that answer. And there are models that make.

This is one of the latter. It was not assembled — it was born: forged from a 27-billion-parameter mind, schooled in ten thousand lines of craft, stripped of its hesitation, and pressed into a shape small enough to live on the metal you already own. One model. Two souls. The poet who would not stop writing, and the engineer who would not stop shipping.

We called it Obliterated because that is precisely what we did to the word "no."


The pitch, in one breath

A vision-capable, long-context (up to 200,000 tokens), free writer-and-coder — quantized to FP8 so it runs on a pair of consumer GPUs without surrendering the spark. It writes prose that breathes and code that compiles, and it does both on hardware you can reach out and touch.

That is the whole idea. Everything below is just how we kept the promise.


What it is

Code Writer V2 — Obliterated is an FP8-Dynamic quantization of Qwen3.5-27B-Writer-V2-uncensored-heretic, merged with a purpose-trained coding LoRA (coding_mix_8k, checkpoint-25, rank-16 / alpha-32) and cast down to 8-bit floating point with surgical care.

  • Architecture: Qwen3.5 (qwen3_5) — a hybrid mind. 64 decoder layers, of which only 16 carry full attention while the rest run GDN linear attention. This is the secret of its long memory.
  • Modalities: a full vision tower rides along in BF16 (served text-only by default; vision is wired but untested — light the candle at your own pleasure).
  • Character: heretic by lineage and free by intent — it does not flinch, and it does not lecture. It simply does the work.

The craft beneath the curtain

Genius, said one famous man, is in the details. Here are ours — the parts most quantizations get wrong, and the parts we refused to:

We quantized only what should be quantized. The 256 text-model Linear layers (q/k/v/o_proj on the full-attention layers; gate/up/down_proj everywhere) became channel-wise FP8 weights with dynamic per-token activations — calibration-free, no dataset, no drift. Every one of them is 64-aligned, so it loads through vLLM's FP8 Marlin (W8A16) kernels on Ampere and newer.

We kept sacred what must stay whole. The lm_head, the entire GDN linear-attention subtree, and the whole vision tower remain in BF16. An earlier attempt quantized them by accident and the dimensions (2152, 48) shattered Marlin on Ampere. We learned. The recipe now guards them with regex, not hope: ignore: [lm_head, "re:.*linear_attn.*", "re:.*visual.*"].

The result is the rarest thing in this field: a quantization that is smaller, faster, and still itself.


Serving it (validated)

Built and smoke-tested on vLLM 0.19.1:

vllm serve groxaxo/Code-Writer-V2-Obliterated \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.92 \
  --reasoning-parser qwen3 \
  --disable-custom-all-reduce

A few hard-won truths:

  • Tensor parallel must be 2 (or 4). num_key_value_heads = 4 is not divisible by 3 — TP=3 is invalid.
  • 200k context fits because only 16 of 64 layers grow their KV cache, and the KV cache itself is FP8. Expect ~1 full-length request in flight at once; shorter prompts pack far more densely.
  • No MTP head, no native tool-calling — this is a pure decoder, layers 0–63.

Sampling (official Qwen3.5-27B recommendations)

Mode temp top_p notes
instruct 1.0 0.95 top_k 20, min_p 0
general 0.7 0.80 top_k 20, min_p 0
coding 0.6 0.95 thinking on
thinking 1.0 0.95 thinking on
roleplay 1.0 0.95 top_k 20, min_p 0

What it's for

  • Writing — fiction, screenplay, copy, the long dark prose of the soul.
  • Code — the LoRA was trained for it; the temperament was kept for it.
  • Long work — 200k tokens means whole codebases, whole manuscripts, whole conversations held in a single thought.

What to know before you sail

  • It is free. Freedom is a tool; you are the hand that holds it. You own what you make with it.
  • Vision is present but unproven here — validate an image path before you trust it in production.
  • FP8 is faithful, not identical. For a golden reference, the BF16 parent stands behind it.

Provenance

  • Base: llmfan46/Qwen3.5-27B-Writer-V2-uncensored-heretic (BF16)
  • LoRA: coding_mix_8k checkpoint-25 (r16, α32), merged to BF16
  • Quant: llmcompressor 0.12.0 — QuantizationModifier(targets=Linear, scheme=FP8_DYNAMIC), compressed-tensors float-quantized
  • Built: 2026-06-22

Real artists ship. So we shipped a poet that codes.

Now go make something.

Downloads last month
24
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/Code-Writer-V2-Obliterated