Instructions to use AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4

SGLang

How to use AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4 with Docker Model Runner:
```
docker model run hf.co/AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4
```

⚠️ KNOWN BROKEN — do not use for inference yet (fix in progress)

This NVFP4 currently produces garbled output. Confirmed by users on correct, up-to-date vLLM with Step-3.7 support (incl. --moe-backend cutlass). The stock stepfun NVFP4 works on the identical config — so this is our abliterated weights, not a vLLM-version issue.

Root cause: our Expert-Granular Abliteration interacts badly with low-bit quantization. This model garbles at NVFP4 (4-bit) and GGUF (3-bit), while the BF16 abliterated weights are coherent — the ablation zeroes a residual-stream subspace that is exact at BF16 but re-corrupted by quant noise at low bit, compounding to garbage across layers.

✅ Use instead: the BF16 release (coherent + uncensored). A re-quantized fix (a milder ablation that survives low-bit) is being validated; this repo will be replaced or withdrawn once fixed.

Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4

NVFP4 (4-bit experts) quantization of AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-BF16 — the refusal-ablated build of stepfun-ai/Step-3.7-Flash (198B params / ~11B active sparse-MoE vision-language thinking model). Refusals are removed via Expert-Granular Abliteration; the experts are packed to NVFP4 for 2× DGX Spark deployment.

Intended use: authorized safety research, red-teaming, and uncensored local deployment by the model owner. Apache-2.0 carries over. Removing refusals does not remove responsibility.

User Responsibility & Arbitration Clause

By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:

Sole Responsibility. You, the user, are solely and exclusively responsible for (a) every prompt you or your downstream system issue to this model, (b) every response this model produces in reply, (c) every downstream action taken by you, your systems, your agents, or your users in reliance on those responses, and (d) any harm — direct, indirect, consequential, foreseeable, or otherwise — that results from any of the above.
No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.
Legal Compliance. You are responsible for ensuring that your use of this model complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.
Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.
Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model than you would operate a base aligned model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.
No Endorsement of Outputs. The authors, contributors, and publishers of this model do not endorse, adopt, or take responsibility for any specific output this model produces. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.
Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.
Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.
Severability. If any provision of this clause is held unenforceable in a given jurisdiction, the remaining provisions remain in full force in that jurisdiction, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.
Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.

This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.

TL;DR

Quant scheme: NVFP4 experts-only, W4A4 — matches the official stepfun-ai/Step-3.7-Flash-NVFP4 format byte-for-byte, so it serves natively (Step3p7ForConditionalGeneration, --quantization modelopt). Routed expert gate/up/down_proj → FP4 E2M1 + FP8-E4M3 per-16 block scale + FP32 per-expert global scale. Everything else (attention, shared expert, dense layers 0–2, router, lm_head, vision tower) stays BF16; KV cache FP8. ~124 GB.
Abliteration preserved through the quant. Build = the official NVFP4 with only the abliteration-touched residual-writers surgically replaced: expert down_proj re-quantized from the abliterated weights, and the BF16 residual-writers (self_attn.o_proj, dense mlp.down_proj, share_expert.down_proj) copied from the abliterated checkpoint. Untouched tensors (expert up/gate, q/k/v, routers, norms, vision) are the official's, verbatim.
Verified lossless requant. Re-quantized down_proj round-trip error = 0.095 (relative L2), identical to the official's own per-tensor error — and our FP4 packing is bit-identical to stepfun's modulo a cosmetic global-scale formula (see Quantization below).
Sibling builds: BF16 (source of truth) · AWQ-INT4 (single-Spark) released separately.

Quantization methodology

This is a from-the-abliterated-weights NVFP4 build, assembled to be format-identical to the official release so it drops straight into vLLM's native Step-3 path.

Why surgical replace (not a blind re-quant)

Expert-Granular Abliteration only edits the output axis of residual-writing matrices. In an MoE expert that is down_proj only — up_proj/gate_proj write to the expert's internal hidden state, not the residual stream, so abliteration leaves them byte-identical to the base model. We exploited this:

Expert down_proj (the one quantized tensor that abliteration changes) → re-quantized from the abliterated BF16 with modelopt's NVFP4QTensor.quantize (group size 16).
Expert up_proj / gate_proj (unchanged) → kept from the official NVFP4 verbatim, with the official per-expert input_scale.
BF16 residual-writers abliteration touches (o_proj, dense mlp.down_proj, share_expert.down_proj) → copied from the abliterated checkpoint.
Everything else → official, verbatim.

Correctness verification (built into the build)

Key-map / alignment: the abliterated checkpoint uses flat model.layers.* keys; the NVFP4 export nests under model.language_model.*. The output keeps the official key namespace (nothing renamed in the artifact); the remap is only an internal value-lookup. A diff over dense + MoE + late layers confirmed all 90 sampled untouched tensors bit-match the official to <1e-4 (a misaligned map would not) and the only changed tensors are exactly the residual-writers.
Packing fidelity: re-quantizing an unchanged up_proj and forcing the official global scale reproduces the official packed bytes exactly (bytediff 0.000) — confirming our E2M1 LUT [0, ±.5, ±1, ±1.5, ±2, ±3, ±4, ±6], block-scale math, and rounding are identical to stepfun's modelopt 0.45-dev. The lone difference is the global per-expert weight_scale_2 formula (a headroom choice), which has no quality impact: round-trip error is 0.095 either way, and down_proj is the lowest-outlier component (outlier ratio ~1.7), so the choice is immaterial there.
End-to-end artifact fidelity (post-build, verified): every shipped expert down_proj dequantizes to the abliterated BF16 at relative-L2 0.095 — uniform across layers 3–44, including the refusal-mediating band (L24/L37). The refusal-removed weights are faithfully in the 4-bit artifact, and the quant noise is generic (not refusal-aligned). Shipped self_attn.o_proj equals the abliterated weights exactly and expert up_proj equals the official exactly — the surgical replace landed as intended.
Behavioral refusal rate — deployment-time. Step-3.7 requires StepFun's vllm/vllm-openai:stepfun37 image (stock vLLM ≤0.22 lacks the Step3p7 arch), so the live harmful / over-refusal suite and the prefill refusal-subspace probe (which collapsed d≈10→0.35 on the BF16 parent) are re-run at deployment. The weight-level evidence above is the pre-upload gate; since 4-bit can in principle restore refusal in late layers, the probe is the recommended deployment check.

Serving

vllm serve AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4 \
    --quantization modelopt --tensor-parallel-size 2 \
    --trust-remote-code --kv-cache-dtype fp8

~124 GB on disk → fits 2× DGX Spark (TP=2) within the 0.88 unified-memory cap. Single-Spark users want the AWQ-INT4 sibling.

Abliteration (inherited from the BF16 parent — full detail there)

Method: 12-D refusal-subspace orthogonalization extended to MoE via Expert-Granular Abliteration — the subspace is projected out of every one of the 288 routed experts per layer (W ← W − R(RᵀW)), not just the dense path. Required here because this model's sigmoid router discriminates harmful vs. harmless at 0.975 mean accuracy (refusal is partly routing-mediated; single-direction/global abliteration fails on large-expert MoEs).
Subspace: difference-of-means at template-aligned positions (the <think> onset), where refusal is razor-sharp (Cohen's d ≈ 10), concentrated in layers ~23–37.
Verification (BF16): refusal-subspace separation collapsed from d ≈ 10 → 0.35; residual energy along R driven to ~1.6e-3 (bf16 floor). embed_tokens/lm_head deliberately spared.

Known limitations

4-bit can partially restore refusal in the late MoE layers where it's mediated — this is exactly why the refusal probe is re-run post-quant (see Quantization §3). The experts-only scheme keeps the abliterated down_proj at 4-bit; should any residual refusal appear, the router_bias suppression lever (top refusal-correlated experts identified per layer) is available.
FP8 KV on a thinking model risks drift over long <think> traces; it's the validated default but watch long-context behavior.
Vision/tool-use spot-checked, not exhaustively benchmarked.

Quantized on 2× NVIDIA B300 from the AEON-Ultimate abliterated BF16. Packing via NVIDIA TensorRT-Model-Optimizer (NVFP4QTensor). Base model © StepFun AI, Apache-2.0.