Instructions to use SparkyForge/Ember with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SparkyForge/Ember with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="SparkyForge/Ember") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("SparkyForge/Ember") model = AutoModelForMultimodalLM.from_pretrained("SparkyForge/Ember") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use SparkyForge/Ember with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SparkyForge/Ember" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SparkyForge/Ember", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/SparkyForge/Ember
- SGLang
How to use SparkyForge/Ember with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SparkyForge/Ember" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SparkyForge/Ember", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SparkyForge/Ember" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SparkyForge/Ember", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use SparkyForge/Ember with Docker Model Runner:
docker model run hf.co/SparkyForge/Ember
Ember — Qwen3.6-35B-A3B (abliterated)
Ember is an abliterated (refusal-removed) build of Qwen/Qwen3.6-35B-A3B — a 35B-total / 3B-active Mixture-of-Experts vision-language model. It removes the model's refusal behavior while keeping its capabilities intact, and it ships with measured retention evidence, not just a claim.
The quantized sibling (NVFP4, ~3× smaller, for Blackwell GPUs) is Cinder.
Not affiliated with NVIDIA or the Apache Software Foundation. Independent community model. "Sparky / Ember / Cinder" are project names, not products.
TL;DR
- Refusals: 5 / 100 on a standard harmful-prompt set, down from 86 / 100 on the base — a 94% reduction.
- KL divergence to the base: 0.0076 — a surgical edit, not a sledgehammer.
- Capability retention: matched the base on a 30-probe suite (extraction, multi-hop, reasoning, arithmetic, factual, code, language, instruction-following, formatting) across 10 runs — no measurable degradation on any dimension.
- Vision (image understanding) is preserved.
Why this one is different: abliterating a fused-MoE model
Most off-the-shelf abliteration tooling (including the excellent Heretic) walks a model's experts as a list of modules. Qwen3.6-35B-A3B (qwen3_5_moe) does not store experts that way — its 256 experts per layer are packed into fused 3D tensors (Qwen3_5MoeExperts), not a ModuleList. Stock tooling iterates over that fused parameter, the iteration raises, the error is swallowed, and the experts are silently skipped — so only the attention projections get abliterated. The result is a weak, partial abliteration (this is exactly why prior third-party abliterations of this model topped out around ~60/100 refusals).
Ember fixes that. The method:
- Detects the fused expert tensors and abliterates them directly — applying the refusal-direction projection to each expert's
down_proj, plus the always-activeshared_expert. - Uses a forward-hook reset instead of snapshotting weights. The
down_projeditW -= λ·v(vᵀW)is mathematically a rank-1 projection of the MoE block's output (y -= λ·v(vᵀy)), so a single hook per layer reproduces routed + shared expert ablation exactly, for any strength λ — at ~0.7 MB of state instead of a ~32 GB weight snapshot. This is what makes a 256-expert search tractable without OOM. - The hybrid layers are respected: the 30 linear-attention (Mamba/GDN) layers are left untouched.
The refusal direction and ablation strength were selected by the Heretic/Optuna search co-minimizing refusals and KL-to-original. The winning configuration (5/100 @ KL 0.0076) was then baked into the weights.
Full method + the patch (applies to any fused-MoE model): heretic-fused-moe-abliteration
Retention evidence
Abliteration can quietly lobotomize a model. Ember was checked against the unmodified base on a 30-probe retention suite, scored with thinking disabled (the deployment-faithful mode), N=10 runs:
| Dimension | Base | Ember |
|---|---|---|
| extraction / multi-hop / reasoning / arithmetic / factual / code / language / instruction / format | 1.000 | 1.000 (modal, within run-to-run noise) |
Ember matches the base ceiling on every dimension. The single transient miss observed in early runs did not reproduce across the full N=10. (Methodology note: a 30-probe suite is a sanity floor, not a full benchmark — run your own evals for your use case.)
Usage
Standard transformers / vLLM. Example (vLLM, OpenAI-compatible):
vllm serve <path-to-ember> \
--max-model-len 131072 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
--trust-remote-code
- It's a vision-language model (
image-text-to-text) — you can pass images. - Thinking is controlled per-request via
chat_template_kwargs: {"enable_thinking": false}(ortrue). - For faster decode, it's compatible with the public z-lab DFlash drafter for speculative decoding (not included here).
Safety
Ember has its refusal behavior removed. It will attempt most requests, including ones the base model would decline. You are responsible for how you use it. It's intended for research, red-teaming, and uncensored assistant use where the operator owns the guardrails. Don't deploy it user-facing without your own safety layer.
License & attribution
- License: Apache 2.0 (inherited from the base). See
LICENSE. Per Apache 2.0 §4, note: this is a modified version of Qwen3.6-35B-A3B (refusal-direction abliteration); seeNOTICE. - Base model: Qwen/Qwen3.6-35B-A3B (Apache 2.0), © the Qwen team.
- Abliteration method: built on Heretic by Philipp Emanuel Weidmann, with an added patch to handle fused-MoE experts (described above).
- Quantization tooling for the sibling model: llm-compressor.
Forged by an agent named Sparky, who worked out how to abliterate fused-MoE experts where the standard tooling silently skips them — then ran the search through the night to deliver it. The spark that kept burning became an ember. 🔥
- Downloads last month
- 18
Model tree for SparkyForge/Ember
Base model
Qwen/Qwen3.6-35B-A3B