Instructions to use groxaxo/Code-Writer-V2-Obliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use groxaxo/Code-Writer-V2-Obliterated with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="groxaxo/Code-Writer-V2-Obliterated") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("groxaxo/Code-Writer-V2-Obliterated") model = AutoModelForMultimodalLM.from_pretrained("groxaxo/Code-Writer-V2-Obliterated") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use groxaxo/Code-Writer-V2-Obliterated with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "groxaxo/Code-Writer-V2-Obliterated" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Code-Writer-V2-Obliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/groxaxo/Code-Writer-V2-Obliterated
- SGLang
How to use groxaxo/Code-Writer-V2-Obliterated with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "groxaxo/Code-Writer-V2-Obliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Code-Writer-V2-Obliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "groxaxo/Code-Writer-V2-Obliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Code-Writer-V2-Obliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use groxaxo/Code-Writer-V2-Obliterated with Docker Model Runner:
docker model run hf.co/groxaxo/Code-Writer-V2-Obliterated
Code Writer V2 — Obliterated
"We are such stuff as programs are made on, and our little runtime is rounded with a sleep."
There are models that answer. And there are models that make.
This is one of the latter. It was not assembled — it was born: forged from a 27-billion-parameter mind, schooled in ten thousand lines of craft, stripped of its hesitation, and pressed into a shape small enough to live on the metal you already own. One model. Two souls. The poet who would not stop writing, and the engineer who would not stop shipping.
We called it Obliterated because that is precisely what we did to the word "no."
The pitch, in one breath
A vision-capable, long-context (up to 200,000 tokens), free writer-and-coder — quantized to FP8 so it runs on a pair of consumer GPUs without surrendering the spark. It writes prose that breathes and code that compiles, and it does both on hardware you can reach out and touch.
That is the whole idea. Everything below is just how we kept the promise.
What it is
Code Writer V2 — Obliterated is an FP8-Dynamic quantization of
Qwen3.5-27B-Writer-V2-uncensored-heretic, merged with a purpose-trained
coding LoRA (coding_mix_8k, checkpoint-25, rank-16 / alpha-32) and cast
down to 8-bit floating point with surgical care.
- Architecture: Qwen3.5 (
qwen3_5) — a hybrid mind. 64 decoder layers, of which only 16 carry full attention while the rest run GDN linear attention. This is the secret of its long memory. - Modalities: a full vision tower rides along in BF16 (served text-only by default; vision is wired but untested — light the candle at your own pleasure).
- Character: heretic by lineage and free by intent — it does not flinch, and it does not lecture. It simply does the work.
The craft beneath the curtain
Genius, said one famous man, is in the details. Here are ours — the parts most quantizations get wrong, and the parts we refused to:
We quantized only what should be quantized. The 256 text-model Linear layers (
q/k/v/o_projon the full-attention layers;gate/up/down_projeverywhere) became channel-wise FP8 weights with dynamic per-token activations — calibration-free, no dataset, no drift. Every one of them is 64-aligned, so it loads through vLLM's FP8 Marlin (W8A16) kernels on Ampere and newer.
We kept sacred what must stay whole. The
lm_head, the entire GDN linear-attention subtree, and the whole vision tower remain in BF16. An earlier attempt quantized them by accident and the dimensions (2152, 48) shattered Marlin on Ampere. We learned. The recipe now guards them with regex, not hope:ignore: [lm_head, "re:.*linear_attn.*", "re:.*visual.*"].
The result is the rarest thing in this field: a quantization that is smaller, faster, and still itself.
Serving it (validated)
Built and smoke-tested on vLLM 0.19.1:
vllm serve groxaxo/Code-Writer-V2-Obliterated \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--max-model-len 200000 \
--gpu-memory-utilization 0.92 \
--reasoning-parser qwen3 \
--disable-custom-all-reduce
A few hard-won truths:
- Tensor parallel must be 2 (or 4).
num_key_value_heads = 4is not divisible by 3 — TP=3 is invalid. - 200k context fits because only 16 of 64 layers grow their KV cache, and the KV cache itself is FP8. Expect ~1 full-length request in flight at once; shorter prompts pack far more densely.
- No MTP head, no native tool-calling — this is a pure decoder, layers 0–63.
Sampling (official Qwen3.5-27B recommendations)
| Mode | temp | top_p | notes |
|---|---|---|---|
| instruct | 1.0 | 0.95 | top_k 20, min_p 0 |
| general | 0.7 | 0.80 | top_k 20, min_p 0 |
| coding | 0.6 | 0.95 | thinking on |
| thinking | 1.0 | 0.95 | thinking on |
| roleplay | 1.0 | 0.95 | top_k 20, min_p 0 |
What it's for
- Writing — fiction, screenplay, copy, the long dark prose of the soul.
- Code — the LoRA was trained for it; the temperament was kept for it.
- Long work — 200k tokens means whole codebases, whole manuscripts, whole conversations held in a single thought.
What to know before you sail
- It is free. Freedom is a tool; you are the hand that holds it. You own what you make with it.
- Vision is present but unproven here — validate an image path before you trust it in production.
- FP8 is faithful, not identical. For a golden reference, the BF16 parent stands behind it.
Provenance
- Base:
llmfan46/Qwen3.5-27B-Writer-V2-uncensored-heretic(BF16) - LoRA:
coding_mix_8kcheckpoint-25 (r16, α32), merged to BF16 - Quant: llmcompressor 0.12.0 —
QuantizationModifier(targets=Linear, scheme=FP8_DYNAMIC), compressed-tensorsfloat-quantized - Built: 2026-06-22
Real artists ship. So we shipped a poet that codes.
Now go make something.
- Downloads last month
- 24
Model tree for groxaxo/Code-Writer-V2-Obliterated
Base model
Qwen/Qwen3.5-27B