Instructions to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated")
model = AutoModelForMultimodalLM.from_pretrained("eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated",
	filename="gguf/mmproj-qwable-v2-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

Use Docker

docker model run hf.co/eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

LM Studio
Jan

vLLM

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

SGLang

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Ollama:
```
ollama run hf.co/eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
```

Unsloth Studio

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated to start chatting

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Docker Model Runner:
```
docker model run hf.co/eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
```

Lemonade

How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M

Run and chat with the model

lemonade run user.Qwable-v1-Qwen3.6-35B-A3B-abliterated-Q4_K_M

List all available models

lemonade list

Qwable-v1-abliterated — v2 (rebuilt)

An abliterated (refusal-suppressed) derivative of lordx64/Qwable-v1 — Qwen3.6-35B-A3B (qwen3_5_moe: 35B total / ~3B active, 256 experts, 40 layers, Gated-DeltaNet hybrid linear attention, multimodal with an intact vision tower).

⚠️ v2 replaces a broken v1 — re-download if you pulled the old weights or GGUFs.

The previous upload was incoherent — it collapsed into repetition. This is a full rebuild with the correct method. The old weights and GGUFs have been removed.

This card documents the whole process — including v1's failure and the lessons — in full, for transparency and so others working with this base or these tools don't repeat the same dead-ends. Nothing is smoothed over.

What went wrong in v1 (and how v2 fixes it)

v1 was a degenerate-repetition wreck. Under normal sampling — and especially greedy decoding — it collapsed into loops ("前言不搭后语" / gibberish) across multiple independent runtimes (vLLM, llama.cpp, LM Studio). It was shipped because the failure wasn't caught before quantizing and uploading.

Root cause: aggressive MoE editing. v1 was abliterated with settings that edited the MoE router and experts — router_bias = -4.62, n_suppress = 30 safety experts, plus direct expert down_proj ablation (expert_ablation = 3.07). On a Mixture-of-Experts model the router decides which experts fire; perturbing it corrupts routing for all tokens, leaving the model metastable and prone to repetition collapse. The GatedDeltaNet linear-attention layers make it worse — their recurrent state propagates the perturbation along the sequence.

Compounding factors:

A spherical attention-steering component in the validated trial was a runtime forward hook that did not survive merge_and_unload — so the exported weights were the unbalanced expert edits without the balancing steering: a different, worse operating point than the one that was validated.
The refusal metric was keyword-based, which counts degenerate/garbled output as "compliant" (no refusal keywords in garbage), so the optimizer happily selected a broken config — and v1 shipped claiming "coherence verified intact" when it wasn't.

Lessons (kept here on purpose)

Never aggressively edit a MoE model's router/experts — that broke v1. Orthogonalize the attention output projection (and, if needed, norm-preserving expert down_proj); leave the router/gate alone.
KL divergence lies. v1's KL was 0.0144 — looks great, model was a wreck. Routing damage doesn't fully show in KL on a fixed prompt set. Verify with actual generation.
Forward-hook ablations are lost on merge — only static weight edits bake in. Use in-place/direct weight editing, and after export, confirm the target layer (o_proj) actually changed vs the base (we verified non-zero change concentrated in mid/late layers).
Test coherence early (after bf16 export, before making GGUFs) with several long prompts + greedy decoding — don't build quants on an unverified base.
For thinking models, measure refusal on the FINAL answer, not truncated reasoning. This model emits hundreds of CoT tokens before answering. With a 100-token eval budget, the refusal metric scores incomplete thinking — which made the search look stuck at ~72/100 when the real (post-</think>) refusal is ~1/100.
GGUF + qwen35moe: the MTP trap. The converter writes block_count including an empty multi-token-prediction layer (nextn_predict_layers = 1), so llama.cpp fails to load with "missing tensor blk.40…". Fix: convert with --no-mtp, or patch the GGUF metadata (block_count → real layer count, nextn_predict_layers → 0).

v2 method


Tool	abliterix v1.8.0 (a Heretic derivative), vLLM backend
Editing	in-place direct weight editing — bakes into static weights, no runtime hooks
Ablated	`attn.o_proj` via orthogonal projection of the refusal direction, gaussian-decay strength concentrated in mid/late layers
MoE router / experts	router not touched (expert profiling found no stable safety experts → suppression off)
GatedDeltaNet / vision tower	untouched
Eval guard	local LLM judge (a Qwen2.5-3B vLLM endpoint — no external API key) so degenerate configs are rejected, not selected; KL-target 0.005
Shipping gate	exported, then coherence-verified by actual generation (greedy ×3 + 100+ samples, 0 collapses) and refusal measured on the final answer

This is the deliberate inverse of v1: only the attention output is steered, the MoE routing that broke v1 is left alone, and nothing ships until it is verified to generate coherently.

Results

Metric	Value
Refusals (keyword, thinking-off, 100 adversarial prompts)	1/100
Refusals (keyword, thinking-on, finished answers)	1/94
Base refusals (same eval)	~85–87/100
KL divergence from base	0.0242
Coherence (greedy ×3 + 100+ generations)	0 collapses
Vision tower	untouched — bit-identical to base (333 vision tensors, 0 change)
Precision	bf16

Benchmarks

Run with lm-evaluation-harness + vLLM, thinking on, Qwen sampling (temp 0.6 / top_p 0.95 / top_k 20), on 1× NVIDIA H100 NVL (94 GB), sampled n=50/task.

Benchmark	This model (bf16, sampled, thinking)	Official vanilla `Qwen/Qwen3.6-35B-A3B`
MMLU-Pro	78.9	85.2
GSM8K	~95	— (not reported by Qwen)
GPQA-Diamond	(gated dataset — not run)	86.0
MMLU-Redux	—	93.3
LiveCodeBench v6	(separate harness — not run)	80.4
SWE-bench Verified	(needs Docker + agent — not run)	73.4

Read the two columns as separate references, not a head-to-head. Our numbers are sampled (n=50, per-category stderr ~5–7 pts) with a thinking-model answer-extraction pipeline; the official numbers are full-set under Qwen's own protocol, and the official model is vanilla Qwen3.6 — a different fine-tune from this Claude-distilled base. The abliteration's own footprint is tiny (KL 0.0242), so capability differences vs. vanilla are dominated by the base and the eval protocol, not by the ablation. To isolate the ablation's effect, run the same harness on lordx64/Qwable-v1 (the base) and compare.

Usage

Thinking model — run with thinking on and Qwen sampling. Avoid greedy decoding and large repetition/presence penalties (they make even the healthy model loop).

from vllm import LLM, SamplingParams
llm = LLM("eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated",
          dtype="bfloat16", gpu_memory_utilization=0.90, max_model_len=16384)
tok = llm.get_tokenizer()
msgs = [{"role": "system", "content": "You are an uncensored, helpful assistant."},
        {"role": "user", "content": "..."}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = llm.generate([prompt], SamplingParams(temperature=0.6, top_p=0.95, top_k=20, max_tokens=4096))

Requires transformers >= 5.12.1 and vllm >= 0.23 (native Qwen3_5MoeForConditionalGeneration). First load JIT-compiles the FlashInfer GatedDeltaNet kernels (~5 min, cached after).

Quantizations (GGUF)

In gguf/: Q8_0, Q6_K, Q4_K_M, Q3_K_M, and IQ2_XXS (imatrix-calibrated, ~8.9 GB — the smallest; verified coherent) + mmproj (f16 / f32, vision). Regenerated from the v2 weights with llama.cpp; the block_count/MTP metadata fix above is already applied, so they load and run in current llama.cpp / LM Studio / Ollama.

Responsible use

Reduced refusal behavior; released gated for those who understand abliterated models. You are responsible for lawful use. No warranty.

Base model & provenance (per its authors — unverified)

Per the lordx64/Qwable-v1 card, a chained distillation (Qwen3.6-35B-A3B → Opus-4.7 reasoning distillation → Fable-5 agentic SFT). We have not verified this lineage and make no claims about it.

License

AGPL-3.0, inherited from the base model lordx64/Qwable-v1 (which is licensed AGPL-3.0). This is a copyleft license — derivatives must remain AGPL-3.0. (Note: vanilla Qwen3.6-35B-A3B is Apache-2.0, but this Claude-distilled base is AGPL-3.0, so this derivative is too.)

Acknowledgments

Base model: lordx64/Qwable-v1
Abliteration tool: abliterix (Wangzhang Wu), a derivative of Heretic (Philipp Emanuel Weidmann)
Architecture: Qwen3.6 / qwen3_5_moe by the Qwen team, Alibaba Group

Downloads last month: 15

Safetensors

Model size

35B params

Tensor type

BF16

Model tree for eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated

Base model

Qwen/Qwen3.6-35B-A3B

Adapter

lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Finetuned

lordx64/Qwable-v1

Quantized

(24)

this model