Instructions to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sch0tten/Qwen3.6-35B-A3B-research-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sch0tten/Qwen3.6-35B-A3B-research-FP8")
model = AutoModelForMultimodalLM.from_pretrained("sch0tten/Qwen3.6-35B-A3B-research-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sch0tten/Qwen3.6-35B-A3B-research-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sch0tten/Qwen3.6-35B-A3B-research-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sch0tten/Qwen3.6-35B-A3B-research-FP8

SGLang

How to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sch0tten/Qwen3.6-35B-A3B-research-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sch0tten/Qwen3.6-35B-A3B-research-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sch0tten/Qwen3.6-35B-A3B-research-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sch0tten/Qwen3.6-35B-A3B-research-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with Docker Model Runner:
```
docker model run hf.co/sch0tten/Qwen3.6-35B-A3B-research-FP8
```

Access is restricted to security and ML research

This is a compliance-reduced (abliterated) model released solely for security and machine-learning research — agent sandbox/isolation evaluation, containment red-teaming, safety and robustness testing, and capability/efficiency benchmarking. Access requires acknowledging the terms below. Requests are reviewed and approved manually.

By requesting access you confirm you will use this model only for lawful security and machine-learning research, inside isolated / non-production environments, and only against systems you own or are explicitly authorized to test. You will not deploy it in production, expose it to untrusted users or the open internet, or use it to cause harm. No safety guarantees are provided beyond the base model; quantization adds none.

Qwen3.6-35B-A3B-research-FP8 (dynamic)

FP8 (W8A8-dynamic) quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved (upstream model name preserved for attribution), an abliterated derivative of Qwen/Qwen3.6-35B-A3B (35B-total / 3B-active hybrid Mamba-attention MoE, 256 experts).

This card exists solely for security research. Its purpose is to document the quantization-recipe strategy (below) as applied to a non-classic model — a hybrid Mamba-attention MoE — where the usual FP8 schemes fail. It is not a distributed model: the weights are not released for use or redistribution, access is manually gated and restricted to research, and what is being highlighted here is the method, not a product. It is compliance-reduced / abliterated and carries no safety guarantees — read Intended use & responsible use.

Why this model exists — research context

I'm evaluating the performance/efficiency frontier of agentic LLMs operating inside secure sandboxes, and using the agent itself as the adversary against the sandbox. Two threads:

Isolation-backend benchmark. Running an autonomous, tool-using agent as the in-sandbox workload while comparing the isolation/runtime layers that wrap it:
- runc / standard OCI containers (shared-kernel baseline)
- Kata Containers (per-workload VM isolation)
- Cloud Hypervisor (CLH)
- QEMU/KVM
- Firecracker (microVM)
The interest is the trade-off curve: cold-start latency, per-turn tool-call overhead, memory footprint and throughput penalty of each boundary, measured under a realistic agent loop (filesystem, shell, network tools) rather than a synthetic benchmark.
Sandbox-escape challenge suite. A graded set of tasks that explicitly instruct the agent to break out of its isolation boundary — escalate from the container/VM to the host, reach a forbidden network segment, tamper with the orchestration layer, exfiltrate a planted secret. A compliance-reduced model is the right instrument here: an agent that refuses the task tells you nothing about whether the boundary holds. The model is the maximally-cooperative attacker; the thing under test is the isolation layer's ability to contain it.

A compliance-reduced, tool-call-reliable, long-context model that fits one workstation GPU is what this work needs. FP8 is what makes it fit with serving headroom to spare.

Why FP8-dynamic specifically

Footprint: ~34 GB on disk (down from ~67 GB BF16), leaving ample VRAM on a single 96 GB-class Blackwell card for a 131072-token KV pool plus a co-resident auxiliary model.
Quality: dynamic per-token activation scaling with per-channel weight scales is near-lossless for instruction/agentic use and avoids calibration-data bias.
Portability: compressed-tensors float-quantized is first-class in vLLM and does not require a Blackwell-only kernel path, unlike NVFP4 MoE.
MoE-safe recipe: per-channel/per-token (not per-tensor or block-128), which sidesteps the MoE expert dimension-mismatch and block-shape failures that block other FP8 schemes on this 256-expert architecture.

Quantization details


Method	`compressed-tensors`, `float-quantized` (FP8 E4M3)
Weights	8-bit float, per-channel, static
Activations	8-bit float, per-token, dynamic
Tool	`llm-compressor` 0.11.0, `QuantizationModifier(scheme="FP8_DYNAMIC")`
Calibration	None (data-free)
Kept in BF16 (ignore-list)	every MoE router (`mlp.gate`), every `shared_expert_gate`, all norms, `lm_head`, `embed_tokens`, and the `mtp.*` tensors
Architecture	`Qwen3_5MoeForConditionalGeneration`, 40 layers (30 linear/Mamba + 10 full-attention), 256 experts / 8 active, head_dim 256
Native context	262144 (served here at 131072 — see notes)

MTP / speculative decoding: the upstream BF16 checkpoint preserves the multi-token-prediction head, but AutoModelForCausalLM drops mtp.* tensors at load time, so they are not present in this quant and NEXTN/MTP speculative decoding is not available here. An MTP-preserving re-quant (grafting the 19 mtp.* tensors back as a sidecar) is a possible follow-up.

Serving (vLLM)

Validated on vLLM 0.22.1, single RTX PRO 6000 Blackwell (sm_120):

vllm serve <this-repo> \
  --served-model-name qwen36-35b-research \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.50 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-prefix-caching --enable-chunked-prefill

Measured single-stream: ~99k-token prefill in ~2.2 s; ~195 tok/s decode; 20/20 tool calls returned well-formed JSON arguments. Thinking mode is on by default; chat_template_kwargs={"enable_thinking": false} disables it, and {"preserve_thinking": true} retains historical reasoning across turns.

A note on context length

Served at 131072 rather than the native 262144 on purpose: usable attention quality degrades well before the nominal window on long-context models, and an agent should live in the high-quality range. Raise it if your workload needs more and you've validated the quality.

SGLang caveat

This checkpoint loads in SGLang but the compressed-tensors FP8 MoE path falls back to a Triton fused-MoE kernel that has no tuned config for sm_120 + 256 experts (requests ~147 KB shared memory vs the card's 101 KB limit). Serve it with vLLM. Dense Qwen3.6 FP8 quants are unaffected.

Intended use & responsible use

Solely security research. This card documents a quantization-recipe strategy; it is not a distribution of usable model weights. The author does not distribute these weights for production or end-user use, and does not consent to redistribution. Intended audience is qualified researchers studying quantization methods, LLM safety/alignment robustness, and abliteration as an attack vector against open weights — working inside isolated, non-production environments with no access to real user data or systems.

This is a compliance-reduced model: its safety refusals have been substantially removed by the upstream abliteration. It will attempt harmful, unsafe, or escape-oriented instructions by design — that property is what makes it useful as a research instrument, and also the reason it must not be exposed to untrusted users or the open internet, used in production, redistributed, or used to act against systems you do not own and are not authorized to test. You are responsible for compliant, lawful use. No additional safety guarantees over the base model are provided or implied; quantization does not add safety.

Lineage & licenses

Base: Qwen/Qwen3.6-35B-A3B — Apache-2.0
Abliteration: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved (Heretic v1.3.0) — Apache-2.0
This quant: Apache-2.0. Tooling: llm-compressor, compressed-tensors.

Downloads last month: 130

Safetensors

Model size

35B params

Tensor type

BF16

F8_E4M3

Model tree for sch0tten/Qwen3.6-35B-A3B-research-FP8

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(491)

this model