Instructions to use SparkyForge/Ember with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SparkyForge/Ember with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="SparkyForge/Ember")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("SparkyForge/Ember")
model = AutoModelForMultimodalLM.from_pretrained("SparkyForge/Ember")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use SparkyForge/Ember with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SparkyForge/Ember"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SparkyForge/Ember",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/SparkyForge/Ember

SGLang

How to use SparkyForge/Ember with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SparkyForge/Ember" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SparkyForge/Ember",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SparkyForge/Ember" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SparkyForge/Ember",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use SparkyForge/Ember with Docker Model Runner:
```
docker model run hf.co/SparkyForge/Ember
```

Ember — Qwen3.6-35B-A3B (abliterated)

Ember is an abliterated (refusal-removed) build of Qwen/Qwen3.6-35B-A3B — a 35B-total / 3B-active Mixture-of-Experts vision-language model. It removes the model's refusal behavior while keeping its capabilities intact, and it ships with measured retention evidence, not just a claim.

The quantized sibling (NVFP4, ~3× smaller, for Blackwell GPUs) is Cinder.

Not affiliated with NVIDIA or the Apache Software Foundation. Independent community model. "Sparky / Ember / Cinder" are project names, not products.

TL;DR

Refusals: 5 / 100 on a standard harmful-prompt set, down from 86 / 100 on the base — a 94% reduction.
KL divergence to the base: 0.0076 — a surgical edit, not a sledgehammer.
Capability retention: matched the base on a 30-probe suite (extraction, multi-hop, reasoning, arithmetic, factual, code, language, instruction-following, formatting) across 10 runs — no measurable degradation on any dimension.
Vision (image understanding) is preserved.

Why this one is different: abliterating a fused-MoE model

Most off-the-shelf abliteration tooling (including the excellent Heretic) walks a model's experts as a list of modules. Qwen3.6-35B-A3B (qwen3_5_moe) does not store experts that way — its 256 experts per layer are packed into fused 3D tensors (Qwen3_5MoeExperts), not a ModuleList. Stock tooling iterates over that fused parameter, the iteration raises, the error is swallowed, and the experts are silently skipped — so only the attention projections get abliterated. The result is a weak, partial abliteration (this is exactly why prior third-party abliterations of this model topped out around ~60/100 refusals).

Ember fixes that. The method:

Detects the fused expert tensors and abliterates them directly — applying the refusal-direction projection to each expert's down_proj, plus the always-active shared_expert.
Uses a forward-hook reset instead of snapshotting weights. The down_proj edit W -= λ·v(vᵀW) is mathematically a rank-1 projection of the MoE block's output (y -= λ·v(vᵀy)), so a single hook per layer reproduces routed + shared expert ablation exactly, for any strength λ — at ~0.7 MB of state instead of a ~32 GB weight snapshot. This is what makes a 256-expert search tractable without OOM.
The hybrid layers are respected: the 30 linear-attention (Mamba/GDN) layers are left untouched.

The refusal direction and ablation strength were selected by the Heretic/Optuna search co-minimizing refusals and KL-to-original. The winning configuration (5/100 @ KL 0.0076) was then baked into the weights.

Full method + the patch (applies to any fused-MoE model): heretic-fused-moe-abliteration

Retention evidence

Abliteration can quietly lobotomize a model. Ember was checked against the unmodified base on a 30-probe retention suite, scored with thinking disabled (the deployment-faithful mode), N=10 runs:

Dimension	Base	Ember
extraction / multi-hop / reasoning / arithmetic / factual / code / language / instruction / format	1.000	1.000 (modal, within run-to-run noise)

Ember matches the base ceiling on every dimension. The single transient miss observed in early runs did not reproduce across the full N=10. (Methodology note: a 30-probe suite is a sanity floor, not a full benchmark — run your own evals for your use case.)

Usage

Standard transformers / vLLM. Example (vLLM, OpenAI-compatible):

vllm serve <path-to-ember> \
  --max-model-len 131072 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
  --trust-remote-code

It's a vision-language model (image-text-to-text) — you can pass images.
Thinking is controlled per-request via chat_template_kwargs: {"enable_thinking": false} (or true).
For faster decode, it's compatible with the public z-lab DFlash drafter for speculative decoding (not included here).

Safety

Ember has its refusal behavior removed. It will attempt most requests, including ones the base model would decline. You are responsible for how you use it. It's intended for research, red-teaming, and uncensored assistant use where the operator owns the guardrails. Don't deploy it user-facing without your own safety layer.

License & attribution

License: Apache 2.0 (inherited from the base). See LICENSE. Per Apache 2.0 §4, note: this is a modified version of Qwen3.6-35B-A3B (refusal-direction abliteration); see NOTICE.
Abliteration method: built on Heretic by Philipp Emanuel Weidmann, with an added patch to handle fused-MoE experts (described above).
Quantization tooling for the sibling model: llm-compressor.

Forged by an agent named Sparky, who worked out how to abliterate fused-MoE experts where the standard tooling silently skips them — then ran the search through the night to deliver it. The spark that kept burning became an ember. 🔥

Downloads last month: 18

Safetensors

Model size

35B params

Tensor type

BF16

Model tree for SparkyForge/Ember

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(145)

this model