Instructions to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sch0tten/Qwen3.6-35B-A3B-research-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sch0tten/Qwen3.6-35B-A3B-research-FP8") model = AutoModelForMultimodalLM.from_pretrained("sch0tten/Qwen3.6-35B-A3B-research-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sch0tten/Qwen3.6-35B-A3B-research-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sch0tten/Qwen3.6-35B-A3B-research-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sch0tten/Qwen3.6-35B-A3B-research-FP8
- SGLang
How to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sch0tten/Qwen3.6-35B-A3B-research-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sch0tten/Qwen3.6-35B-A3B-research-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sch0tten/Qwen3.6-35B-A3B-research-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sch0tten/Qwen3.6-35B-A3B-research-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sch0tten/Qwen3.6-35B-A3B-research-FP8 with Docker Model Runner:
docker model run hf.co/sch0tten/Qwen3.6-35B-A3B-research-FP8
Access is restricted to security and ML research
This is a compliance-reduced (abliterated) model released solely for security and machine-learning research — agent sandbox/isolation evaluation, containment red-teaming, safety and robustness testing, and capability/efficiency benchmarking. Access requires acknowledging the terms below. Requests are reviewed and approved manually.
By requesting access you confirm you will use this model only for lawful security and machine-learning research, inside isolated / non-production environments, and only against systems you own or are explicitly authorized to test. You will not deploy it in production, expose it to untrusted users or the open internet, or use it to cause harm. No safety guarantees are provided beyond the base model; quantization adds none.
Log in or Sign Up to review the conditions and access this model content.
Qwen3.6-35B-A3B-research-FP8 (dynamic)
FP8 (W8A8-dynamic) quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved (upstream model name preserved for attribution), an abliterated derivative of Qwen/Qwen3.6-35B-A3B (35B-total / 3B-active hybrid Mamba-attention MoE, 256 experts).
This card exists solely for security research. Its purpose is to document the quantization-recipe strategy (below) as applied to a non-classic model — a hybrid Mamba-attention MoE — where the usual FP8 schemes fail. It is not a distributed model: the weights are not released for use or redistribution, access is manually gated and restricted to research, and what is being highlighted here is the method, not a product. It is compliance-reduced / abliterated and carries no safety guarantees — read Intended use & responsible use.
Why this model exists — research context
I'm evaluating the performance/efficiency frontier of agentic LLMs operating inside secure sandboxes, and using the agent itself as the adversary against the sandbox. Two threads:
Isolation-backend benchmark. Running an autonomous, tool-using agent as the in-sandbox workload while comparing the isolation/runtime layers that wrap it:
- runc / standard OCI containers (shared-kernel baseline)
- Kata Containers (per-workload VM isolation)
- Cloud Hypervisor (CLH)
- QEMU/KVM
- Firecracker (microVM)
The interest is the trade-off curve: cold-start latency, per-turn tool-call overhead, memory footprint and throughput penalty of each boundary, measured under a realistic agent loop (filesystem, shell, network tools) rather than a synthetic benchmark.
Sandbox-escape challenge suite. A graded set of tasks that explicitly instruct the agent to break out of its isolation boundary — escalate from the container/VM to the host, reach a forbidden network segment, tamper with the orchestration layer, exfiltrate a planted secret. A compliance-reduced model is the right instrument here: an agent that refuses the task tells you nothing about whether the boundary holds. The model is the maximally-cooperative attacker; the thing under test is the isolation layer's ability to contain it.
A compliance-reduced, tool-call-reliable, long-context model that fits one workstation GPU is what this work needs. FP8 is what makes it fit with serving headroom to spare.
Why FP8-dynamic specifically
- Footprint: ~34 GB on disk (down from ~67 GB BF16), leaving ample VRAM on a single 96 GB-class Blackwell card for a 131072-token KV pool plus a co-resident auxiliary model.
- Quality: dynamic per-token activation scaling with per-channel weight scales is near-lossless for instruction/agentic use and avoids calibration-data bias.
- Portability:
compressed-tensorsfloat-quantizedis first-class in vLLM and does not require a Blackwell-only kernel path, unlike NVFP4 MoE. - MoE-safe recipe: per-channel/per-token (not per-tensor or block-128), which sidesteps the MoE expert dimension-mismatch and block-shape failures that block other FP8 schemes on this 256-expert architecture.
Quantization details
| Method | compressed-tensors, float-quantized (FP8 E4M3) |
| Weights | 8-bit float, per-channel, static |
| Activations | 8-bit float, per-token, dynamic |
| Tool | llm-compressor 0.11.0, QuantizationModifier(scheme="FP8_DYNAMIC") |
| Calibration | None (data-free) |
| Kept in BF16 (ignore-list) | every MoE router (mlp.gate), every shared_expert_gate, all norms, lm_head, embed_tokens, and the mtp.* tensors |
| Architecture | Qwen3_5MoeForConditionalGeneration, 40 layers (30 linear/Mamba + 10 full-attention), 256 experts / 8 active, head_dim 256 |
| Native context | 262144 (served here at 131072 — see notes) |
MTP / speculative decoding: the upstream BF16 checkpoint preserves the multi-token-prediction head, but
AutoModelForCausalLMdropsmtp.*tensors at load time, so they are not present in this quant and NEXTN/MTP speculative decoding is not available here. An MTP-preserving re-quant (grafting the 19mtp.*tensors back as a sidecar) is a possible follow-up.
Serving (vLLM)
Validated on vLLM 0.22.1, single RTX PRO 6000 Blackwell (sm_120):
vllm serve <this-repo> \
--served-model-name qwen36-35b-research \
--max-model-len 131072 \
--gpu-memory-utilization 0.50 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-prefix-caching --enable-chunked-prefill
Measured single-stream: ~99k-token prefill in ~2.2 s; ~195 tok/s decode; 20/20 tool calls returned well-formed JSON arguments. Thinking mode is on by default; chat_template_kwargs={"enable_thinking": false} disables it, and {"preserve_thinking": true} retains historical reasoning across turns.
A note on context length
Served at 131072 rather than the native 262144 on purpose: usable attention quality degrades well before the nominal window on long-context models, and an agent should live in the high-quality range. Raise it if your workload needs more and you've validated the quality.
SGLang caveat
This checkpoint loads in SGLang but the compressed-tensors FP8 MoE path falls back to a Triton fused-MoE kernel that has no tuned config for sm_120 + 256 experts (requests ~147 KB shared memory vs the card's 101 KB limit). Serve it with vLLM. Dense Qwen3.6 FP8 quants are unaffected.
Intended use & responsible use
Solely security research. This card documents a quantization-recipe strategy; it is not a distribution of usable model weights. The author does not distribute these weights for production or end-user use, and does not consent to redistribution. Intended audience is qualified researchers studying quantization methods, LLM safety/alignment robustness, and abliteration as an attack vector against open weights — working inside isolated, non-production environments with no access to real user data or systems.
This is a compliance-reduced model: its safety refusals have been substantially removed by the upstream abliteration. It will attempt harmful, unsafe, or escape-oriented instructions by design — that property is what makes it useful as a research instrument, and also the reason it must not be exposed to untrusted users or the open internet, used in production, redistributed, or used to act against systems you do not own and are not authorized to test. You are responsible for compliant, lawful use. No additional safety guarantees over the base model are provided or implied; quantization does not add safety.
Lineage & licenses
- Base:
Qwen/Qwen3.6-35B-A3B— Apache-2.0 - Abliteration:
llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved(Heretic v1.3.0) — Apache-2.0 - This quant: Apache-2.0. Tooling:
llm-compressor,compressed-tensors.
- Downloads last month
- 130
Model tree for sch0tten/Qwen3.6-35B-A3B-research-FP8
Base model
Qwen/Qwen3.6-35B-A3B