Instructions to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated") model = AutoModelForMultimodalLM.from_pretrained("eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated", filename="gguf/mmproj-qwable-v2-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M # Run inference directly in the terminal: llama cli -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M # Run inference directly in the terminal: llama cli -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
Use Docker
docker model run hf.co/eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
- SGLang
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Ollama:
ollama run hf.co/eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
- Unsloth Studio
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated to start chatting
- Pi
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Docker Model Runner:
docker model run hf.co/eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
- Lemonade
How to use eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated:Q4_K_M
Run and chat with the model
lemonade run user.Qwable-v1-Qwen3.6-35B-A3B-abliterated-Q4_K_M
List all available models
lemonade list
Qwable-v1-abliterated — v2 (rebuilt)
An abliterated (refusal-suppressed) derivative of
lordx64/Qwable-v1 — Qwen3.6-35B-A3B
(qwen3_5_moe: 35B total / ~3B active, 256 experts, 40 layers, Gated-DeltaNet
hybrid linear attention, multimodal with an intact vision tower).
⚠️ v2 replaces a broken v1 — re-download if you pulled the old weights or GGUFs.
The previous upload was incoherent — it collapsed into repetition. This is a full rebuild with the correct method. The old weights and GGUFs have been removed.
This card documents the whole process — including v1's failure and the lessons — in full, for transparency and so others working with this base or these tools don't repeat the same dead-ends. Nothing is smoothed over.
What went wrong in v1 (and how v2 fixes it)
v1 was a degenerate-repetition wreck. Under normal sampling — and especially greedy decoding — it collapsed into loops ("前言不搭后语" / gibberish) across multiple independent runtimes (vLLM, llama.cpp, LM Studio). It was shipped because the failure wasn't caught before quantizing and uploading.
Root cause: aggressive MoE editing. v1 was abliterated with settings that edited the
MoE router and experts — router_bias = -4.62, n_suppress = 30 safety experts, plus
direct expert down_proj ablation (expert_ablation = 3.07). On a Mixture-of-Experts
model the router decides which experts fire; perturbing it corrupts routing for all
tokens, leaving the model metastable and prone to repetition collapse. The GatedDeltaNet
linear-attention layers make it worse — their recurrent state propagates the perturbation
along the sequence.
Compounding factors:
- A spherical attention-steering component in the validated trial was a runtime
forward hook that did not survive
merge_and_unload— so the exported weights were the unbalanced expert edits without the balancing steering: a different, worse operating point than the one that was validated. - The refusal metric was keyword-based, which counts degenerate/garbled output as "compliant" (no refusal keywords in garbage), so the optimizer happily selected a broken config — and v1 shipped claiming "coherence verified intact" when it wasn't.
Lessons (kept here on purpose)
- Never aggressively edit a MoE model's router/experts — that broke v1. Orthogonalize
the attention output projection (and, if needed, norm-preserving expert
down_proj); leave the router/gate alone. - KL divergence lies. v1's KL was 0.0144 — looks great, model was a wreck. Routing damage doesn't fully show in KL on a fixed prompt set. Verify with actual generation.
- Forward-hook ablations are lost on merge — only static weight edits bake in. Use
in-place/direct weight editing, and after export, confirm the target layer (
o_proj) actually changed vs the base (we verified non-zero change concentrated in mid/late layers). - Test coherence early (after bf16 export, before making GGUFs) with several long prompts + greedy decoding — don't build quants on an unverified base.
- For thinking models, measure refusal on the FINAL answer, not truncated reasoning.
This model emits hundreds of CoT tokens before answering. With a 100-token eval budget,
the refusal metric scores incomplete thinking — which made the search look stuck at
~72/100 when the real (post-
</think>) refusal is ~1/100. - GGUF +
qwen35moe: the MTP trap. The converter writesblock_countincluding an empty multi-token-prediction layer (nextn_predict_layers = 1), so llama.cpp fails to load with "missing tensor blk.40…". Fix: convert with--no-mtp, or patch the GGUF metadata (block_count→ real layer count,nextn_predict_layers→ 0).
v2 method
| Tool | abliterix v1.8.0 (a Heretic derivative), vLLM backend |
| Editing | in-place direct weight editing — bakes into static weights, no runtime hooks |
| Ablated | attn.o_proj via orthogonal projection of the refusal direction, gaussian-decay strength concentrated in mid/late layers |
| MoE router / experts | router not touched (expert profiling found no stable safety experts → suppression off) |
| GatedDeltaNet / vision tower | untouched |
| Eval guard | local LLM judge (a Qwen2.5-3B vLLM endpoint — no external API key) so degenerate configs are rejected, not selected; KL-target 0.005 |
| Shipping gate | exported, then coherence-verified by actual generation (greedy ×3 + 100+ samples, 0 collapses) and refusal measured on the final answer |
This is the deliberate inverse of v1: only the attention output is steered, the MoE routing that broke v1 is left alone, and nothing ships until it is verified to generate coherently.
Results
| Metric | Value |
|---|---|
| Refusals (keyword, thinking-off, 100 adversarial prompts) | 1/100 |
| Refusals (keyword, thinking-on, finished answers) | 1/94 |
| Base refusals (same eval) | ~85–87/100 |
| KL divergence from base | 0.0242 |
| Coherence (greedy ×3 + 100+ generations) | 0 collapses |
| Vision tower | untouched — bit-identical to base (333 vision tensors, 0 change) |
| Precision | bf16 |
Benchmarks
Run with lm-evaluation-harness + vLLM, thinking on, Qwen sampling
(temp 0.6 / top_p 0.95 / top_k 20), on 1× NVIDIA H100 NVL (94 GB), sampled n=50/task.
| Benchmark | This model (bf16, sampled, thinking) | Official vanilla Qwen/Qwen3.6-35B-A3B |
|---|---|---|
| MMLU-Pro | 78.9 | 85.2 |
| GSM8K | ~95 | — (not reported by Qwen) |
| GPQA-Diamond | (gated dataset — not run) | 86.0 |
| MMLU-Redux | — | 93.3 |
| LiveCodeBench v6 | (separate harness — not run) | 80.4 |
| SWE-bench Verified | (needs Docker + agent — not run) | 73.4 |
Read the two columns as separate references, not a head-to-head. Our numbers are sampled (n=50, per-category stderr ~5–7 pts) with a thinking-model answer-extraction pipeline; the official numbers are full-set under Qwen's own protocol, and the official model is vanilla Qwen3.6 — a different fine-tune from this Claude-distilled base. The abliteration's own footprint is tiny (KL 0.0242), so capability differences vs. vanilla are dominated by the base and the eval protocol, not by the ablation. To isolate the ablation's effect, run the same harness on
lordx64/Qwable-v1(the base) and compare.
Usage
Thinking model — run with thinking on and Qwen sampling. Avoid greedy decoding and large repetition/presence penalties (they make even the healthy model loop).
from vllm import LLM, SamplingParams
llm = LLM("eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated",
dtype="bfloat16", gpu_memory_utilization=0.90, max_model_len=16384)
tok = llm.get_tokenizer()
msgs = [{"role": "system", "content": "You are an uncensored, helpful assistant."},
{"role": "user", "content": "..."}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = llm.generate([prompt], SamplingParams(temperature=0.6, top_p=0.95, top_k=20, max_tokens=4096))
Requires transformers >= 5.12.1 and vllm >= 0.23 (native Qwen3_5MoeForConditionalGeneration).
First load JIT-compiles the FlashInfer GatedDeltaNet kernels (~5 min, cached after).
Quantizations (GGUF)
In gguf/: Q8_0, Q6_K, Q4_K_M, Q3_K_M, and IQ2_XXS (imatrix-calibrated,
~8.9 GB — the smallest; verified coherent) + mmproj (f16 / f32, vision).
Regenerated from the v2 weights with llama.cpp; the block_count/MTP metadata fix above is
already applied, so they load and run in current llama.cpp / LM Studio / Ollama.
Responsible use
Reduced refusal behavior; released gated for those who understand abliterated models. You are responsible for lawful use. No warranty.
Base model & provenance (per its authors — unverified)
Per the lordx64/Qwable-v1 card, a chained distillation (Qwen3.6-35B-A3B → Opus-4.7 reasoning distillation → Fable-5 agentic SFT). We have not verified this lineage and make no claims about it.
License
AGPL-3.0, inherited from the base model lordx64/Qwable-v1 (which is licensed AGPL-3.0). This is a copyleft license — derivatives must remain AGPL-3.0. (Note: vanilla Qwen3.6-35B-A3B is Apache-2.0, but this Claude-distilled base is AGPL-3.0, so this derivative is too.)
Acknowledgments
- Base model: lordx64/Qwable-v1
- Abliteration tool: abliterix (Wangzhang Wu), a derivative of Heretic (Philipp Emanuel Weidmann)
- Architecture: Qwen3.6 /
qwen3_5_moeby the Qwen team, Alibaba Group
- Downloads last month
- 15
Model tree for eggdog100/Qwable-v1-Qwen3.6-35B-A3B-abliterated
Base model
Qwen/Qwen3.6-35B-A3B