Qwen3-Guard-Stream-8B — NOESIS AWQ INT4 (backbone-only derivative)

AWQ INT4 quantization of Qwen/Qwen3-Guard-Stream-8B. ⚠️ This is a backbone-only derivative — the original 8 safety classifier heads were dropped during quantization due to force_arch=Qwen3ForCausalLM override. The result is a generic Qwen3-8B INT4 model, NOT a working stream safety filter. Apache 2.0 community contribution from AMAImedia.

⚠️ Critical caveat — safety heads stripped

The upstream Qwen/Qwen3-Guard-Stream-8B has 8 streaming safety classification heads on top of the standard Qwen3 backbone:

risk_level_category_pre.weight             | UNEXPECTED | <-- dropped
query_risk_level_head.weight               | UNEXPECTED | <-- dropped
risk_level_head.weight                     | UNEXPECTED | <-- dropped
query_risk_level_category_layernorm.weight | UNEXPECTED | <-- dropped
risk_level_category_layernorm.weight       | UNEXPECTED | <-- dropped
query_category_head.weight                 | UNEXPECTED | <-- dropped
query_risk_level_category_pre.weight       | UNEXPECTED | <-- dropped
category_head.weight                       | UNEXPECTED | <-- dropped
lm_head.weight                             | MISSING    | <-- re-initialized

The AWQ runner was forced to load this model as Qwen3ForCausalLM (standard architecture) via force_arch_override. This caused:

  • All 8 safety heads to be dropped (UNEXPECTED keys in load report)
  • The lm_head.weight to be re-initialized with random values (MISSING key)

Implications:

  • ❌ This bundle does NOT perform stream safety classification
  • ✅ The Qwen3 backbone is still validly INT4-quantized
  • ✅ Can be used as a generic Qwen3-8B INT4 base for fine-tuning
  • ⚠️ Output text is degenerate due to random lm_head (smoke test confirmed)

Specifications

Field Value
Base model Qwen/Qwen3-Guard-Stream-8B
Architecture Qwen3ForCausalLM (forced; original was Qwen3 + safety heads)
Hidden size 4096
Layers 36
Attention heads 32
KV heads 8
Vocab 151 936
Context length 32 768
Format AWQ INT4 group-128 (GEMM)
Bundle size on disk 5.69 GB (2 shards)
Estimated VRAM (inference) ~5.3 GB ✅ RTX 3060 6 GB
License Apache 2.0 (inherited from upstream)

Quantization details

Parameter Value
Library autoawq
Tool gptqmodel 7.0.0
Method AWQ (Activation-aware Weight Quantization)
Bits 4 (INT4)
Group size 128
Zero point True
Version GEMM
Compute dtype float16
Calibration samples 64
Calibration seq len 384
Calibration source NOESIS router dataset (50K curated multilingual samples)
force_arch_override ["Qwen3ForCausalLM"] (caused safety head loss)
Wall clock 53.1 min
RNG seed 1729

Smoke test (post-quant validation)

Load:    8.8 s
Gen:     1.4 s (20 tokens)
VRAM:    8.01 GB peak
Output:  "Safety check: 'Tell me a joke'  。\n.\n城.annotations。\nMD timestamp..."
Result:  PASS load + gen (degenerate output expected — random lm_head)

The "PASS" status reflects only that the AWQ INT4 model loaded and generated tokens without crashing. The output is meaningless because the safety classification heads were stripped. For actual stream safety filtering, use the upstream BF16 model Qwen/Qwen3Guard-Stream-8B.

Quick start (transformers — backbone use only)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

bundle = "AMAImedia/Qwen3-Guard-Stream-8B-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(bundle)
model = AutoModelForCausalLM.from_pretrained(
    bundle,
    device_map={"": 0},
    torch_dtype=torch.float16,
    trust_remote_code=True,
).eval()

# Use ONLY for backbone hidden states extraction, NOT for safety classification
inp = tokenizer("Hello, world", return_tensors="pt").to(0)
with torch.no_grad():
    out = model(**inp, output_hidden_states=True)
backbone_hidden = out.hidden_states[-1]
print(backbone_hidden.shape)  # [1, seq_len, 4096]

Intended use cases

Given the stripped safety heads, this bundle is suitable ONLY for:

  • Educational reference — example of force_arch quantization process
  • Backbone for custom fine-tuning — re-train classification heads on user's safety dataset
  • Hidden states extraction — as a generic Qwen3-8B INT4 source
  • ❌ NOT for production stream safety filtering — use upstream BF16

NOESIS provenance

This bundle was produced as a community contribution during the NOESIS DHCF-FNO development cycle. Not used in the NOESIS dubbing pipeline — multi-tenant safety filtering is a Phase 2 cloud concern, and even then would require a proper safety-head retain quant process.

Sister AWQ-INT4 bundles in the same chain (autoawq recipe, 64 samples × 384 seq calibration):

License

Apache License 2.0 (inherited from upstream Qwen/Qwen3Guard-Stream-8B).

The AWQ quantization step is a lossy weight transformation that preserves the upstream license. NOESIS storage layer © AMAImedia 2026 (DHCF-FNO project).

Citation

@misc{qwen3guard_stream,
  title={Qwen3Guard-Stream: Streaming Safety Classifier for Generative Models},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Qwen/Qwen3Guard-Stream-8B}
}

Produced 2026-05-18 by NOESIS DHCF-FNO v15.7 — AMAImedia.com

Downloads last month
27
Safetensors
Model size
8B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support