Qwen3-Embedding-8B — NOESIS AWQ INT4 (backbone-only derivative)

AWQ INT4 quantization of Qwen/Qwen3-Embedding-8B. ⚠️ This is a backbone-only derivative — lm_head.weight was MISSING from the source (embedding models use mean-pooling, not language modeling) so AWQ re-initialized it with random weights. For embedding inference you still need a separate pooling layer. Apache 2.0 community contribution from AMAImedia.

⚠️ Critical caveat — embedding head required

The upstream Qwen/Qwen3-Embedding-8B produces text embeddings via mean pooling over the last hidden state, NOT via the lm_head (which doesn't exist in the source — see load report below).

lm_head.weight | MISSING | <-- re-initialized by AWQ tooling

The AWQ runner expected a Qwen3ForCausalLM-style architecture which requires lm_head, found nothing, and initialized it with random weights. Generation through this lm_head produces gibberish (confirmed by smoke test).

For embedding inference, ignore lm_head entirely — use mean pooling over last_hidden_state:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4",
    device_map={"": 0}, torch_dtype=torch.float16,
).eval()
tokenizer = AutoTokenizer.from_pretrained("AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4")

inp = tokenizer(["Hello world"], return_tensors="pt", padding=True).to(0)
with torch.no_grad():
    out = model(**inp, output_hidden_states=True)

# Mean pooling over non-padded tokens
mask = inp.attention_mask.unsqueeze(-1).float()
hidden = out.hidden_states[-1]  # [B, T, 4096]
embedding = (hidden * mask).sum(1) / mask.sum(1)
print(embedding.shape)  # [1, 4096]

# Normalize for cosine similarity
import torch.nn.functional as F
embedding = F.normalize(embedding, p=2, dim=-1)

Implications:

❌ Do NOT call .generate() — output is gibberish (random lm_head)
✅ The Qwen3 backbone is validly INT4-quantized
✅ Mean-pooled embeddings should work near-upstream quality
⚠️ For production retrieval, validate AWQ embeddings vs BF16 baseline

Specifications

Field	Value
Base model	`Qwen/Qwen3-Embedding-8B`
Architecture	`Qwen3ForCausalLM` (forced; original was Qwen3 backbone + mean pool)
Hidden size	4096
Layers	36
Attention heads	32
KV heads	8
Vocab	151 936
Embedding dim	4096 (matches hidden_size; mean-pooled)
Context length	32 768
Format	AWQ INT4 group-128 (GEMM)
Bundle size on disk	5.69 GB (2 shards)
Estimated VRAM (inference)	~5.3 GB ✅ RTX 3060 6 GB
License	Apache 2.0 (inherited from upstream)

Quantization details

Parameter	Value
Library	`autoawq`
Tool	`gptqmodel 7.0.0`
Method	AWQ (Activation-aware Weight Quantization)
Bits	4 (INT4)
Group size	128
Zero point	True
Version	GEMM
Compute dtype	float16
Calibration samples	64
Calibration seq len	384
Calibration source	NOESIS router dataset (50K curated multilingual samples)
Wall clock	62.6 min
RNG seed	1729

Quantized layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. NOT quantized: lm_head (random — see caveat), embed_tokens, all *norm layers.

Smoke test (post-quant validation)

Load:    11.0 s
Gen:     2.1 s (20 tokens, via random lm_head)
VRAM:    8.00 GB peak
Output:  'Encode this for retrieval: NOESIS multilingual dubbing (" ("(" ...'
Result:  PASS load (degenerate gen expected — lm_head random)

The "PASS" status reflects only that the AWQ INT4 model loaded successfully. The generated text is gibberish because lm_head is randomly initialized. Embedding inference (mean-pooled hidden states) is what this model is for — see Quick start above.

Use cases

✅ Semantic search / retrieval — text → 4096-dim embedding via mean pooling
✅ Sentence similarity — cosine distance between normalized embeddings
✅ Clustering — group documents by embedding similarity
❌ Text generation — DO NOT use, lm_head is random

Quality validation (recommended before production)

Compare embedding quality against upstream BF16 on your dataset:

# Pseudo-eval
from sklearn.metrics.pairwise import cosine_similarity

upstream_emb = encode_with_bf16("query")
awq_emb = encode_with_awq("query")  # this bundle
sim = cosine_similarity([upstream_emb], [awq_emb])
# AWQ INT4 typically retains 95-98% embedding fidelity

NOESIS provenance

This bundle was produced as a community contribution during the NOESIS DHCF-FNO development cycle. Not used in the NOESIS dubbing pipeline directly — NOESIS uses BidirLM-Omni-2.5B-NF4 (2.62 GB, cross-modal text+image+audio) instead, which is more compact and supports multi-modal queries.

Sister AWQ-INT4 bundles in the same chain (autoawq recipe, 64 samples × 384 seq calibration):

License

Apache License 2.0 (inherited from upstream Qwen/Qwen3-Embedding-8B).

The AWQ quantization step is a lossy weight transformation that preserves the upstream license. NOESIS storage layer © AMAImedia 2026 (DHCF-FNO project).

Citation

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}

Produced 2026-05-18 by NOESIS DHCF-FNO v15.7 — AMAImedia.com

Downloads last month: 382

Safetensors

Model size

8B params

Tensor type

I32

BF16

Model tree for AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Embedding-8B

Quantized

(33)

this model

Paper for AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 83