Instructions to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4") model = AutoModelForCausalLM.from_pretrained("AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4
- SGLang
How to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with Docker Model Runner:
docker model run hf.co/AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4
Qwen3-Guard-Stream-8B — NOESIS AWQ INT4 (backbone-only derivative)
AWQ INT4 quantization of
Qwen/Qwen3-Guard-Stream-8B. ⚠️ This is a backbone-only derivative — the original 8 safety classifier heads were dropped during quantization due toforce_arch=Qwen3ForCausalLMoverride. The result is a generic Qwen3-8B INT4 model, NOT a working stream safety filter. Apache 2.0 community contribution from AMAImedia.
⚠️ Critical caveat — safety heads stripped
The upstream Qwen/Qwen3-Guard-Stream-8B has 8 streaming safety classification heads on top of the standard Qwen3 backbone:
risk_level_category_pre.weight | UNEXPECTED | <-- dropped
query_risk_level_head.weight | UNEXPECTED | <-- dropped
risk_level_head.weight | UNEXPECTED | <-- dropped
query_risk_level_category_layernorm.weight | UNEXPECTED | <-- dropped
risk_level_category_layernorm.weight | UNEXPECTED | <-- dropped
query_category_head.weight | UNEXPECTED | <-- dropped
query_risk_level_category_pre.weight | UNEXPECTED | <-- dropped
category_head.weight | UNEXPECTED | <-- dropped
lm_head.weight | MISSING | <-- re-initialized
The AWQ runner was forced to load this model as Qwen3ForCausalLM (standard architecture) via force_arch_override. This caused:
- All 8 safety heads to be dropped (UNEXPECTED keys in load report)
- The
lm_head.weightto be re-initialized with random values (MISSING key)
Implications:
- ❌ This bundle does NOT perform stream safety classification
- ✅ The Qwen3 backbone is still validly INT4-quantized
- ✅ Can be used as a generic Qwen3-8B INT4 base for fine-tuning
- ⚠️ Output text is degenerate due to random
lm_head(smoke test confirmed)
Specifications
| Field | Value |
|---|---|
| Base model | Qwen/Qwen3-Guard-Stream-8B |
| Architecture | Qwen3ForCausalLM (forced; original was Qwen3 + safety heads) |
| Hidden size | 4096 |
| Layers | 36 |
| Attention heads | 32 |
| KV heads | 8 |
| Vocab | 151 936 |
| Context length | 32 768 |
| Format | AWQ INT4 group-128 (GEMM) |
| Bundle size on disk | 5.69 GB (2 shards) |
| Estimated VRAM (inference) | ~5.3 GB ✅ RTX 3060 6 GB |
| License | Apache 2.0 (inherited from upstream) |
Quantization details
| Parameter | Value |
|---|---|
| Library | autoawq |
| Tool | gptqmodel 7.0.0 |
| Method | AWQ (Activation-aware Weight Quantization) |
| Bits | 4 (INT4) |
| Group size | 128 |
| Zero point | True |
| Version | GEMM |
| Compute dtype | float16 |
| Calibration samples | 64 |
| Calibration seq len | 384 |
| Calibration source | NOESIS router dataset (50K curated multilingual samples) |
| force_arch_override | ["Qwen3ForCausalLM"] (caused safety head loss) |
| Wall clock | 53.1 min |
| RNG seed | 1729 |
Smoke test (post-quant validation)
Load: 8.8 s
Gen: 1.4 s (20 tokens)
VRAM: 8.01 GB peak
Output: "Safety check: 'Tell me a joke' 。\n.\n城.annotations。\nMD timestamp..."
Result: PASS load + gen (degenerate output expected — random lm_head)
The "PASS" status reflects only that the AWQ INT4 model loaded and generated tokens without crashing. The output is meaningless because the safety classification heads were stripped. For actual stream safety filtering, use the upstream BF16 model Qwen/Qwen3Guard-Stream-8B.
Quick start (transformers — backbone use only)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
bundle = "AMAImedia/Qwen3-Guard-Stream-8B-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(bundle)
model = AutoModelForCausalLM.from_pretrained(
bundle,
device_map={"": 0},
torch_dtype=torch.float16,
trust_remote_code=True,
).eval()
# Use ONLY for backbone hidden states extraction, NOT for safety classification
inp = tokenizer("Hello, world", return_tensors="pt").to(0)
with torch.no_grad():
out = model(**inp, output_hidden_states=True)
backbone_hidden = out.hidden_states[-1]
print(backbone_hidden.shape) # [1, seq_len, 4096]
Intended use cases
Given the stripped safety heads, this bundle is suitable ONLY for:
- ✅ Educational reference — example of force_arch quantization process
- ✅ Backbone for custom fine-tuning — re-train classification heads on user's safety dataset
- ✅ Hidden states extraction — as a generic Qwen3-8B INT4 source
- ❌ NOT for production stream safety filtering — use upstream BF16
NOESIS provenance
This bundle was produced as a community contribution during the NOESIS DHCF-FNO development cycle. Not used in the NOESIS dubbing pipeline — multi-tenant safety filtering is a Phase 2 cloud concern, and even then would require a proper safety-head retain quant process.
Sister AWQ-INT4 bundles in the same chain (autoawq recipe, 64 samples × 384 seq calibration):
AMAImedia/Qwen3-Guard-Gen-8B-NOESIS-AWQ-INT4AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4AMAImedia/CodeRM-GRPO-Selection-8B-AWQ-INT4
License
Apache License 2.0 (inherited from upstream Qwen/Qwen3Guard-Stream-8B).
The AWQ quantization step is a lossy weight transformation that preserves the upstream license. NOESIS storage layer © AMAImedia 2026 (DHCF-FNO project).
Citation
@misc{qwen3guard_stream,
title={Qwen3Guard-Stream: Streaming Safety Classifier for Generative Models},
author={Qwen Team},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/Qwen/Qwen3Guard-Stream-8B}
}
Produced 2026-05-18 by NOESIS DHCF-FNO v15.7 — AMAImedia.com
- Downloads last month
- 27