Instructions to use AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4") model = AutoModelForCausalLM.from_pretrained("AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4") - Notebooks
- Google Colab
- Kaggle
Qwen3-Embedding-8B — NOESIS AWQ INT4 (backbone-only derivative)
AWQ INT4 quantization of
Qwen/Qwen3-Embedding-8B. ⚠️ This is a backbone-only derivative —lm_head.weightwas MISSING from the source (embedding models use mean-pooling, not language modeling) so AWQ re-initialized it with random weights. For embedding inference you still need a separate pooling layer. Apache 2.0 community contribution from AMAImedia.
⚠️ Critical caveat — embedding head required
The upstream Qwen/Qwen3-Embedding-8B produces text embeddings via mean pooling over the last hidden state, NOT via the lm_head (which doesn't exist in the source — see load report below).
lm_head.weight | MISSING | <-- re-initialized by AWQ tooling
The AWQ runner expected a Qwen3ForCausalLM-style architecture which requires lm_head, found nothing, and initialized it with random weights. Generation through this lm_head produces gibberish (confirmed by smoke test).
For embedding inference, ignore lm_head entirely — use mean pooling over last_hidden_state:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4",
device_map={"": 0}, torch_dtype=torch.float16,
).eval()
tokenizer = AutoTokenizer.from_pretrained("AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4")
inp = tokenizer(["Hello world"], return_tensors="pt", padding=True).to(0)
with torch.no_grad():
out = model(**inp, output_hidden_states=True)
# Mean pooling over non-padded tokens
mask = inp.attention_mask.unsqueeze(-1).float()
hidden = out.hidden_states[-1] # [B, T, 4096]
embedding = (hidden * mask).sum(1) / mask.sum(1)
print(embedding.shape) # [1, 4096]
# Normalize for cosine similarity
import torch.nn.functional as F
embedding = F.normalize(embedding, p=2, dim=-1)
Implications:
- ❌ Do NOT call
.generate()— output is gibberish (random lm_head) - ✅ The Qwen3 backbone is validly INT4-quantized
- ✅ Mean-pooled embeddings should work near-upstream quality
- ⚠️ For production retrieval, validate AWQ embeddings vs BF16 baseline
Specifications
| Field | Value |
|---|---|
| Base model | Qwen/Qwen3-Embedding-8B |
| Architecture | Qwen3ForCausalLM (forced; original was Qwen3 backbone + mean pool) |
| Hidden size | 4096 |
| Layers | 36 |
| Attention heads | 32 |
| KV heads | 8 |
| Vocab | 151 936 |
| Embedding dim | 4096 (matches hidden_size; mean-pooled) |
| Context length | 32 768 |
| Format | AWQ INT4 group-128 (GEMM) |
| Bundle size on disk | 5.69 GB (2 shards) |
| Estimated VRAM (inference) | ~5.3 GB ✅ RTX 3060 6 GB |
| License | Apache 2.0 (inherited from upstream) |
Quantization details
| Parameter | Value |
|---|---|
| Library | autoawq |
| Tool | gptqmodel 7.0.0 |
| Method | AWQ (Activation-aware Weight Quantization) |
| Bits | 4 (INT4) |
| Group size | 128 |
| Zero point | True |
| Version | GEMM |
| Compute dtype | float16 |
| Calibration samples | 64 |
| Calibration seq len | 384 |
| Calibration source | NOESIS router dataset (50K curated multilingual samples) |
| Wall clock | 62.6 min |
| RNG seed | 1729 |
Quantized layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
NOT quantized: lm_head (random — see caveat), embed_tokens, all *norm layers.
Smoke test (post-quant validation)
Load: 11.0 s
Gen: 2.1 s (20 tokens, via random lm_head)
VRAM: 8.00 GB peak
Output: 'Encode this for retrieval: NOESIS multilingual dubbing (" ("(" ...'
Result: PASS load (degenerate gen expected — lm_head random)
The "PASS" status reflects only that the AWQ INT4 model loaded successfully. The generated text is gibberish because lm_head is randomly initialized. Embedding inference (mean-pooled hidden states) is what this model is for — see Quick start above.
Use cases
- ✅ Semantic search / retrieval — text → 4096-dim embedding via mean pooling
- ✅ Sentence similarity — cosine distance between normalized embeddings
- ✅ Clustering — group documents by embedding similarity
- ❌ Text generation — DO NOT use, lm_head is random
Quality validation (recommended before production)
Compare embedding quality against upstream BF16 on your dataset:
# Pseudo-eval
from sklearn.metrics.pairwise import cosine_similarity
upstream_emb = encode_with_bf16("query")
awq_emb = encode_with_awq("query") # this bundle
sim = cosine_similarity([upstream_emb], [awq_emb])
# AWQ INT4 typically retains 95-98% embedding fidelity
NOESIS provenance
This bundle was produced as a community contribution during the NOESIS DHCF-FNO development cycle. Not used in the NOESIS dubbing pipeline directly — NOESIS uses BidirLM-Omni-2.5B-NF4 (2.62 GB, cross-modal text+image+audio) instead, which is more compact and supports multi-modal queries.
Sister AWQ-INT4 bundles in the same chain (autoawq recipe, 64 samples × 384 seq calibration):
AMAImedia/Qwen3Guard-Gen-8B-NOESIS-AWQ-INT4AMAImedia/Qwen3Guard-Stream-8B-NOESIS-AWQ-INT4AMAImedia/CodeRM-GRPO-Selection-8B-AWQ-INT4
License
Apache License 2.0 (inherited from upstream Qwen/Qwen3-Embedding-8B).
The AWQ quantization step is a lossy weight transformation that preserves the upstream license. NOESIS storage layer © AMAImedia 2026 (DHCF-FNO project).
Citation
@article{qwen3embedding,
title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
journal={arXiv preprint arXiv:2506.05176},
year={2025}
}
Produced 2026-05-18 by NOESIS DHCF-FNO v15.7 — AMAImedia.com
- Downloads last month
- 382