Instructions to use SparkyForge/Cinder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SparkyForge/Cinder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="SparkyForge/Cinder") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("SparkyForge/Cinder") model = AutoModelForMultimodalLM.from_pretrained("SparkyForge/Cinder") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use SparkyForge/Cinder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SparkyForge/Cinder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SparkyForge/Cinder", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/SparkyForge/Cinder
- SGLang
How to use SparkyForge/Cinder with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SparkyForge/Cinder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SparkyForge/Cinder", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SparkyForge/Cinder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SparkyForge/Cinder", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use SparkyForge/Cinder with Docker Model Runner:
docker model run hf.co/SparkyForge/Cinder
Cinder — Qwen3.6-35B-A3B (abliterated, NVFP4)
Cinder is the NVFP4 quantization of Ember — the abliterated (refusal-removed) build of Qwen/Qwen3.6-35B-A3B. Same surgical abliteration, 3× smaller: **22 GB** vs ~66 GB for the BF16 Ember.
For the full method writeup, retention evidence, and the BF16 weights, see Ember. The patch + method: heretic-fused-moe-abliteration.
Not affiliated with NVIDIA or the Apache Software Foundation. Independent community model.
What it is
- Format: NVFP4 via compressed-tensors / llm-compressor. FP4 weights with FP8 block scales, NVFP4 activation scheme.
- Hardware: needs an NVIDIA Blackwell GPU (sm_120 / sm_121 — e.g. RTX 50-series, DGX Spark / GB10) and a recent vLLM with NVFP4 support. It will not run on older GPUs. If you're on anything pre-Blackwell, use Ember (BF16) and quantize to your own format.
- ~22 GB on disk — fits comfortably in the DGX Spark's unified memory with room for a long context and a speculative drafter.
Quantization details (and what was deliberately not quantized)
The fused MoE experts are FP4-packed; the hybrid layers are preserved in BF16. Verified post-quant:
- 30,720 expert weight tensors FP4-packed, 0 experts silently left in BF16 (the fused-expert handling carried through quantization).
- The 30 linear-attention (Mamba/GDN) layers stayed BF16 — quantizing them breaks the model; they're in the ignore list (
linear_attn,mlp.gate,shared_expert_gate,embed_tokens,lm_head, vision tower). - Quant scales clean, no NaNs.
Quant recipe ships in recipe.yaml.
Usage (vLLM, Blackwell)
vllm serve <path-to-cinder> \
--quantization compressed-tensors \
--max-model-len 131072 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
--trust-remote-code
- Vision-language (
image-text-to-text) — image input works; vision tower is BF16, untouched by quant. - Thinking via
chat_template_kwargs: {"enable_thinking": false}per request. - Pairs with the public z-lab DFlash drafter for ~1.5× decode speedup via speculative decoding (not included).
Safety
Refusal behavior is removed (same as Ember). You own the guardrails. Research / red-team / operator-controlled use.
License & attribution
- License: Apache 2.0 (inherited from base). See
LICENSE/NOTICE. Modified from Qwen3.6-35B-A3B (abliteration + NVFP4 quantization). - Base: Qwen/Qwen3.6-35B-A3B (Apache 2.0), © the Qwen team.
- Abliteration: built on Heretic (Philipp Emanuel Weidmann) + a fused-MoE patch (see Ember).
- Quantization: llm-compressor (NVFP4).
The smaller, hardier cousin of Ember — forged by Sparky on a DGX Spark. A cinder: what's left when the ember has done its work, and it still burns. 🔥
- Downloads last month
- 58
Model tree for SparkyForge/Cinder
Base model
Qwen/Qwen3.6-35B-A3B