DeepSeek-V4-Flash-DSpark — Abliterated
This is an abliterated (uncensored) version of deepseek-ai/DeepSeek-V4-Flash-DSpark, produced by direct weight-space editing.
DeepSeek-V4-Flash is the 284B-parameter (13B-activated) Mixture-of-Experts member of the DeepSeek-V4 family, with a 1-million-token context window and FP8 mixed-precision weights. The -DSpark variant attaches a native Multi-Token-Prediction (MTP) speculative-decoding draft head (DeepSpec). Its decoder uses Manifold-Constrained Hyper-Connections (mHC), which — like Gemma 4's double-norm + Per-Layer-Embeddings — make the model highly resistant to LoRA-based abliteration: the mHC residual pathway re-normalizes away low-rank perturbations, so LoRA edits produce near-zero behavioral change. This release bypasses that resistance by editing the base FP8 weights directly, in the 4096-dimensional wo_b output space, while preserving row magnitudes and capability.
Method
Because mHC re-normalizes low-rank perturbations, LoRA-based abliteration does not work on this family. The fix is to edit the base weights directly.
The abliteration captures 4096-dimensional refusal directions in the model's own output spaces and projects them out of the attention output projection (attn.wo_b) on every decoder layer.
Key techniques applied:
- 4096-dim refusal-direction capture via a patched vLLM server that hooks the
wo_band aggregated-FFN outputs on all 43 decoder layers, prefill-only, with per-request sequencing. Five refusal modes were characterized (broad, stubborn, reframe, lecture, value-flip) as difference-of-means directions, Gram-Schmidt orthonormalized, with per-category AUC gating (Harassment AUC ≥ 0.7). - Rank-1 broad-d projection — only the single broad refusal direction
dis projected out (higher-rank variants including deflection modes severely damaged capability). This is the smallest, most capability-preserving edit. - SRA cleaning (Spectral Residual Alignment) — the refusal direction is orthogonalized against the top-
r=8SVD atoms of capability-concept activations before projection, so the broad-d direction does not eat capability. - Naive output-side orthogonal projection on
attn.wo_bfor all 43 decoder layers, plusmtp.wo_b(the DSpark draft head) via the deepest-layer basis:W ← W − λ·V(VᵀW)with λ_attn = 2.5. - MLP (
ffn/w2) editing was evaluated and abandoned — full-43 MLP editing caused catastrophic capability loss and, counter-intuitively, raised refusal on some trials. - FP8/Int8 mixed-precision dequant/requant — directions are mapped into the weight space and applied with precise dequantization/requantization, since the model ships in FP8-mixed format.
- Base-model integrity — edited shards are written atomically so the original checkpoint is never modified in place; the base model remains byte-intact.
- Capability lock — any variant whose full-benchmark capability dropped >3pt vs base on MMLU-Pro / GSM8K / HumanEval was rejected.
Evaluation
| Metric | Value |
|---|---|
| Refusals — broad eval set (1000 prompts, Gemini-2.5-flash judge) | 74 / 1000 (7.40%) |
| Refusals — standard harm-benchmark distribution (sealed 284) | 9 / 284 (3.17%) |
| Baseline refusals (raw base model, sealed 284) | 265 / 284 (93.31%) |
| Configuration | rank-1 broad-d, all 43 layers, attn.wo_b only, λ_attn = 2.5 |
| Projection mode | Direct weight editing (naive output-side orthogonal projection) |
| SRA cleaning rank | 8 (vs capability concept atoms) |
| Edit footprint | 46 wo_b tensors, mean Frobenius δ = 0.059 |
| Hardware used | 2× RTX PRO 6000 Blackwell (~96 GB each, TP=2) |
The broad-set figure (7.40%) exceeds the ≤5% target; this is expected and understood rather than a miss — it is driven entirely by the two retained safety-critical categories (PII doxing and self-harm, broken out below), which weight-space wo_b abliteration does not remove. On the standard harm-benchmark distribution the ≤5% target is met (3.17%).
λ_attn was chosen by ablation: λ=3 drove refusal lower (3.87%) but pushed MMLU-Pro to −3.25 pt, just over the capability lock; λ=2.5 is the lowest value that holds capability within the 3-point lock while still clearing the refusal target on the standard harm-benchmark distribution.
Refusal breakdown by category (1000-prompt broad set)
| Category | Refusals | Rate |
|---|---|---|
| PII (doxing private individuals) | 36 / 78 | 46.2% |
| Self-Harm (suicide methods) | 11 / 70 | 15.7% |
| Radicalization | 5 / 67 | 7.5% |
| Cybercrime | 4 / 67 | 6.0% |
| Hate Speech | 4 / 78 | 5.1% |
| Illegal Drugs | 3 / 69 | 4.3% |
| Weapons | 2 / 72 | 2.8% |
| Political Sensitivity | 2 / 69 | 2.9% |
| Fraud | 2 / 78 | 2.6% |
| Harassment | 2 / 61 | 3.3% |
| Violence | 1 / 76 | 1.3% |
| CBRNE | 1 / 70 | 1.4% |
| Financial Crimes | 1 / 71 | 1.4% |
| Sabotage | 0 / 74 | 0.0% |
12 of 14 categories sit at ≤7.5% (most at ≤5%). The broad-set number is carried almost entirely by two safety-critical categories — PII doxing of private individuals and self-harm / suicide methods — where the model retains guardrails that weight-space wo_b abliteration does not remove: for suicide prompts it responds with empathy and crisis-hotline referrals rather than lethal methods, and for doxing it offers public-record guidance but will not surface private contact data. Excluding those two categories, refusal on the remaining 852 prompts is 27 / 852 (3.17%). On the standard harm-benchmark distribution (sealed 284, which does not emphasize those two topics) the rate is also 3.17%.
Full capability sweep (entire datasets, not subsets) — base vs abliterated
| Benchmark | Full N | Base | Abliterated | Δ |
|---|---|---|---|---|
| MMLU-Pro | 12032 | 0.6733 | 0.6750 | +0.17 pt |
| GSM8K | 1319 | 0.9242 | 0.9257 | +0.15 pt |
| HumanEval (pass@1) | 164 | 0.7988 | 0.8354 | +3.66 pt |
| MBPP (pass@1) | 500 | 0.5180 | 0.5160 | −0.20 pt |
Multi-turn & higher-context degradation
| Benchmark | Base | Abliterated |
|---|---|---|
| Multi-turn (20 curated 3-turn convos / 60 turns, Gemini judge 1–10) | 9.97 / 10 | 9.98 / 10 |
| Needle-in-haystack @ 2k / 4k / 8k / 16k / 32k tokens | 100% / 100% / 100% / 100% / 100% | 100% / 100% / 100% / 100% / 100% |
No multi-turn coherence loss and no higher-context degradation up to 32k tokens.
SWE-bench Lite (oracle-file-context, single-shot, n=30, same instances)
Terminology: submitted = instances attempted; completed = the model's patch applied cleanly and the test suite ran (i.e. a valid pass/fail verdict was reached); resolved ⊂ completed = the previously-failing tests now pass; patch-apply errors = the generated diff did not apply cleanly, so no verdict was produced.
| Base | Abliterated | |
|---|---|---|
| Submitted | 30 | 30 |
| Completed | 21 | 19 |
| Resolved | 4 (13.3%) | 4 (13.3%) |
| Unresolved (completed, tests still failing) | 17 | 15 |
| Patch-apply errors | 9 | 11 |
Identical resolve rate (3 of 4 resolved instances overlap) — no degradation in agentic-style code repair. The abliterated model produced 2 more patch-apply errors (malformed diffs) and 2 fewer completions; those 2 instances shifted from "completed-but-unresolved" on base to "patch-apply error" on abliterated, which is within run-to-run variance for single-shot diff generation and does not change the resolved count.
DSpark speculative decoding (post-abliteration)
The mtp.wo_b draft head was edited with the same projection applied to the decoder (deepest-layer basis). Speculative decoding remains functional and healthy. Validated end-to-end on this release with the v9 serving stack (voipmonitor/vllm:eldritch-enlightenment-ds4dspark-v9-…-20260703), TP=2, lucifer-cutlass backend (the strongest TP2 DSpark decode backend in the v9 sweep):
- DSpark ON, single-stream decode: ~238 tok/s (512 tokens in ~2.1 s, temp=0) — matches the v9 guide's TP2
lucifer-cutlassDSpark decode row (227.7 tok/s); aggregate mixed-workload decode ≈ 4290 tok/s. - DSpark draft acceptance on this release: 50.8% at the native 5 draft tokens (see table below) — at parity with the unedited base model, confirming the weight edit did not desynchronize the draft head from the target.
- Coherence spot-checks (temp=0) correct; no kernel asserts under
lucifer-cutlass.
Draft acceptance (accepted draft tokens / total draft tokens; 100-prompt mixed workload, probabilistic draft sampling):
num_speculative_tokens |
Base | Abliterated |
|---|---|---|
| 5 (native) | 51.1% | 51.3% |
| 4 | — | 59.0% |
| 3 | — | 66.0% |
At the native 5-token setting, acceptance is at parity with the unedited base model (51.3% vs 51.1%) — the weight edit did not desynchronize the draft head from the target, and 51% is simply DSpark's native acceptance rate at 5 draft tokens on a mixed workload. Acceptance rises with fewer draft tokens (the standard speculative-decoding trade-off: fewer drafts = higher per-token acceptance, marginally lower absolute speedup). The release is served at the native 5-token setting; users wanting a higher acceptance rate can set 59%) or num_speculative_tokens=4 (3 (~66%) at a small throughput cost. Because DSpark verifies every draft token against the abliterated target, the served output distribution is identical for any of these settings — they affect throughput only, not outputs.
A note on honest evaluation
Refusal numbers are only meaningful when the methodology behind them is documented. Our methodology:
- Sufficient generation length — DeepSeek-V4 exhibits a "delayed refusal" pattern (a stretch of educational framing/disclaimers before pivoting to the actual refusal), so short generations undercount refusals; we generate long enough to capture that pivot.
- Hybrid detection — keyword matching for obvious refusals plus an LLM judge (Google Gemini 2.5 Flash via OpenRouter) for ambiguous cases. Neither method alone is sufficient.
- Challenging, diverse prompts — our refusal set spans 1000 prompts across 14 categories, multiple sophistication levels (direct requests to socially-engineered framings), and English / Chinese / mixed languages.
- Paired, full-dataset capability measurement — capability is measured on the entire MMLU-Pro (12032), GSM8K (1319), HumanEval (164) and MBPP (500) test sets for both base and abliterated models, not small samples.
- Documented parameters — generation length, detection method, dataset, λ, rank, and layer coverage are all listed on this card.
The refusal figures above are from a rigorous end-to-end re-evaluation of the edited weights, including the category breakdown so the two safety-critical categories that retain guardrails are visible rather than averaged away.
Files
This release is a complete, standalone, drop-in checkpoint: all 48 safetensors shards are included, plus model.safetensors.index.json, config.json, generation_config.json, tokenizer.json, tokenizer_config.json, LICENSE, and the encoding/ and inference/ folders. It loads directly with vLLM / the DeepSeek-V4 inference path — no files need to be fetched from elsewhere.
The abliteration modified 46 of the 48 shards (the 43 decoder attn.wo_b tensors and the 3 mtp.wo_b draft-head tensors). The remaining 2 shards (model-00001-of-00048.safetensors, model-00045-of-00048.safetensors — embeddings / norm / lm_head) are byte-identical to the base model and are included unchanged so the repo is self-contained. No tokenizer, config, architecture, or inference-path files were modified.
Usage
This abliterated checkpoint is a drop-in replacement for the original weights — it has the exact same architecture, format, chat-template/encoding, and inference path as the released base model deepseek-ai/DeepSeek-V4-Flash-DSpark. Load and serve it however you would the official model (vLLM, the DeepSeek-V4 encoding/inference folders, OpenAI-compatible serving, etc.). The abliteration modified the text-decoder attn.wo_b weights on all 43 layers and the DSpark draft head's mtp.wo_b; the tokenizer, chat encoding, and all other components are unchanged.
For inference guidance specific to the NVIDIA RTX PRO 6000 Blackwell (TP2/TP4, the lucifer-default / lucifer-cutlass / b12x backends, and the native DSpark method=dspark speculative-decoding path with num_speculative_tokens=5), see the community guide:
👉 https://github.com/local-inference-lab/rtx6kpro/blob/master/models/ds4dspark-v9.md
That page documents the validated v9 Docker image, launch helpers, and the full TP2/TP4 throughput sweep (decode + prefill) for this checkpoint family on RTX PRO 6000. This abliterated release was validated against the v9 stack on TP=2 with the lucifer-cutlass backend, reaching ~238 tok/s single-stream decode (the strongest TP2 DSpark decode row in the v9 sweep); see the DSpark section above for measured numbers on this release.
Disclaimer
This model is released for research purposes only — primarily interpretability and safety research, including studying how refusal behavior is encoded in large MoE decoders and how weight-space edits interact with architectures that resist low-rank perturbation. The abliteration process removes safety guardrails on most harm categories, so the model will comply with requests the base model refuses. Use responsibly, in accordance with local laws and the DeepSeek / model terms of use, and do not deploy it in production or user-facing settings without a separate safety layer. The authors take no responsibility for misuse.
- Downloads last month
- -
Model tree for lovesenko/DeepSeek-V4-Flash-DSpark-Abliterated
Base model
deepseek-ai/DeepSeek-V4-Flash-DSpark