Instructions to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4")
model = AutoModelForCausalLM.from_pretrained("AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4

SGLang

How to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4 with Docker Model Runner:
```
docker model run hf.co/AMAImedia/Qwen3-8B-Guard-Stream-NOESIS-AWQ-INT4
```

Qwen3-Guard-Stream-8B — NOESIS AWQ INT4 (backbone-only derivative)

AWQ INT4 quantization of Qwen/Qwen3-Guard-Stream-8B. ⚠️ This is a backbone-only derivative — the original 8 safety classifier heads were dropped during quantization due to force_arch=Qwen3ForCausalLM override. The result is a generic Qwen3-8B INT4 model, NOT a working stream safety filter. Apache 2.0 community contribution from AMAImedia.

⚠️ Critical caveat — safety heads stripped

The upstream Qwen/Qwen3-Guard-Stream-8B has 8 streaming safety classification heads on top of the standard Qwen3 backbone:

risk_level_category_pre.weight             | UNEXPECTED | <-- dropped
query_risk_level_head.weight               | UNEXPECTED | <-- dropped
risk_level_head.weight                     | UNEXPECTED | <-- dropped
query_risk_level_category_layernorm.weight | UNEXPECTED | <-- dropped
risk_level_category_layernorm.weight       | UNEXPECTED | <-- dropped
query_category_head.weight                 | UNEXPECTED | <-- dropped
query_risk_level_category_pre.weight       | UNEXPECTED | <-- dropped
category_head.weight                       | UNEXPECTED | <-- dropped
lm_head.weight                             | MISSING    | <-- re-initialized

The AWQ runner was forced to load this model as Qwen3ForCausalLM (standard architecture) via force_arch_override. This caused:

All 8 safety heads to be dropped (UNEXPECTED keys in load report)
The lm_head.weight to be re-initialized with random values (MISSING key)

Implications:

❌ This bundle does NOT perform stream safety classification
✅ The Qwen3 backbone is still validly INT4-quantized
✅ Can be used as a generic Qwen3-8B INT4 base for fine-tuning
⚠️ Output text is degenerate due to random lm_head (smoke test confirmed)

Specifications

Field	Value
Base model	`Qwen/Qwen3-Guard-Stream-8B`
Architecture	`Qwen3ForCausalLM` (forced; original was Qwen3 + safety heads)
Hidden size	4096
Layers	36
Attention heads	32
KV heads	8
Vocab	151 936
Context length	32 768
Format	AWQ INT4 group-128 (GEMM)
Bundle size on disk	5.69 GB (2 shards)
Estimated VRAM (inference)	~5.3 GB ✅ RTX 3060 6 GB
License	Apache 2.0 (inherited from upstream)

Quantization details

Parameter	Value
Library	`autoawq`
Tool	`gptqmodel 7.0.0`
Method	AWQ (Activation-aware Weight Quantization)
Bits	4 (INT4)
Group size	128
Zero point	True
Version	GEMM
Compute dtype	float16
Calibration samples	64
Calibration seq len	384
Calibration source	NOESIS router dataset (50K curated multilingual samples)
force_arch_override	`["Qwen3ForCausalLM"]` (caused safety head loss)
Wall clock	53.1 min
RNG seed	1729

Smoke test (post-quant validation)

Load:    8.8 s
Gen:     1.4 s (20 tokens)
VRAM:    8.01 GB peak
Output:  "Safety check: 'Tell me a joke'  。\n.\n城.annotations。\nMD timestamp..."
Result:  PASS load + gen (degenerate output expected — random lm_head)

The "PASS" status reflects only that the AWQ INT4 model loaded and generated tokens without crashing. The output is meaningless because the safety classification heads were stripped. For actual stream safety filtering, use the upstream BF16 model Qwen/Qwen3Guard-Stream-8B.

Quick start (transformers — backbone use only)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

bundle = "AMAImedia/Qwen3-Guard-Stream-8B-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(bundle)
model = AutoModelForCausalLM.from_pretrained(
    bundle,
    device_map={"": 0},
    torch_dtype=torch.float16,
    trust_remote_code=True,
).eval()

# Use ONLY for backbone hidden states extraction, NOT for safety classification
inp = tokenizer("Hello, world", return_tensors="pt").to(0)
with torch.no_grad():
    out = model(**inp, output_hidden_states=True)
backbone_hidden = out.hidden_states[-1]
print(backbone_hidden.shape)  # [1, seq_len, 4096]

Intended use cases

Given the stripped safety heads, this bundle is suitable ONLY for:

✅ Educational reference — example of force_arch quantization process
✅ Backbone for custom fine-tuning — re-train classification heads on user's safety dataset
✅ Hidden states extraction — as a generic Qwen3-8B INT4 source
❌ NOT for production stream safety filtering — use upstream BF16

NOESIS provenance

This bundle was produced as a community contribution during the NOESIS DHCF-FNO development cycle. Not used in the NOESIS dubbing pipeline — multi-tenant safety filtering is a Phase 2 cloud concern, and even then would require a proper safety-head retain quant process.

Sister AWQ-INT4 bundles in the same chain (autoawq recipe, 64 samples × 384 seq calibration):

License

Apache License 2.0 (inherited from upstream Qwen/Qwen3Guard-Stream-8B).

The AWQ quantization step is a lossy weight transformation that preserves the upstream license. NOESIS storage layer © AMAImedia 2026 (DHCF-FNO project).

Citation

@misc{qwen3guard_stream,
  title={Qwen3Guard-Stream: Streaming Safety Classifier for Generative Models},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Qwen/Qwen3Guard-Stream-8B}
}

Produced 2026-05-18 by NOESIS DHCF-FNO v15.7 — AMAImedia.com

Downloads last month: 27

Safetensors

Model size

8B params

Tensor type

I32

BF16