Qwen3-Embedding-8B — NOESIS AWQ INT4 (backbone-only derivative)

AWQ INT4 quantization of Qwen/Qwen3-Embedding-8B. ⚠️ This is a backbone-only derivativelm_head.weight was MISSING from the source (embedding models use mean-pooling, not language modeling) so AWQ re-initialized it with random weights. For embedding inference you still need a separate pooling layer. Apache 2.0 community contribution from AMAImedia.

⚠️ Critical caveat — embedding head required

The upstream Qwen/Qwen3-Embedding-8B produces text embeddings via mean pooling over the last hidden state, NOT via the lm_head (which doesn't exist in the source — see load report below).

lm_head.weight | MISSING | <-- re-initialized by AWQ tooling

The AWQ runner expected a Qwen3ForCausalLM-style architecture which requires lm_head, found nothing, and initialized it with random weights. Generation through this lm_head produces gibberish (confirmed by smoke test).

For embedding inference, ignore lm_head entirely — use mean pooling over last_hidden_state:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4",
    device_map={"": 0}, torch_dtype=torch.float16,
).eval()
tokenizer = AutoTokenizer.from_pretrained("AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4")

inp = tokenizer(["Hello world"], return_tensors="pt", padding=True).to(0)
with torch.no_grad():
    out = model(**inp, output_hidden_states=True)

# Mean pooling over non-padded tokens
mask = inp.attention_mask.unsqueeze(-1).float()
hidden = out.hidden_states[-1]  # [B, T, 4096]
embedding = (hidden * mask).sum(1) / mask.sum(1)
print(embedding.shape)  # [1, 4096]

# Normalize for cosine similarity
import torch.nn.functional as F
embedding = F.normalize(embedding, p=2, dim=-1)

Implications:

  • ❌ Do NOT call .generate() — output is gibberish (random lm_head)
  • ✅ The Qwen3 backbone is validly INT4-quantized
  • ✅ Mean-pooled embeddings should work near-upstream quality
  • ⚠️ For production retrieval, validate AWQ embeddings vs BF16 baseline

Specifications

Field Value
Base model Qwen/Qwen3-Embedding-8B
Architecture Qwen3ForCausalLM (forced; original was Qwen3 backbone + mean pool)
Hidden size 4096
Layers 36
Attention heads 32
KV heads 8
Vocab 151 936
Embedding dim 4096 (matches hidden_size; mean-pooled)
Context length 32 768
Format AWQ INT4 group-128 (GEMM)
Bundle size on disk 5.69 GB (2 shards)
Estimated VRAM (inference) ~5.3 GB ✅ RTX 3060 6 GB
License Apache 2.0 (inherited from upstream)

Quantization details

Parameter Value
Library autoawq
Tool gptqmodel 7.0.0
Method AWQ (Activation-aware Weight Quantization)
Bits 4 (INT4)
Group size 128
Zero point True
Version GEMM
Compute dtype float16
Calibration samples 64
Calibration seq len 384
Calibration source NOESIS router dataset (50K curated multilingual samples)
Wall clock 62.6 min
RNG seed 1729

Quantized layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. NOT quantized: lm_head (random — see caveat), embed_tokens, all *norm layers.

Smoke test (post-quant validation)

Load:    11.0 s
Gen:     2.1 s (20 tokens, via random lm_head)
VRAM:    8.00 GB peak
Output:  'Encode this for retrieval: NOESIS multilingual dubbing (" ("(" ...'
Result:  PASS load (degenerate gen expected — lm_head random)

The "PASS" status reflects only that the AWQ INT4 model loaded successfully. The generated text is gibberish because lm_head is randomly initialized. Embedding inference (mean-pooled hidden states) is what this model is for — see Quick start above.

Use cases

  • Semantic search / retrieval — text → 4096-dim embedding via mean pooling
  • Sentence similarity — cosine distance between normalized embeddings
  • Clustering — group documents by embedding similarity
  • Text generation — DO NOT use, lm_head is random

Quality validation (recommended before production)

Compare embedding quality against upstream BF16 on your dataset:

# Pseudo-eval
from sklearn.metrics.pairwise import cosine_similarity

upstream_emb = encode_with_bf16("query")
awq_emb = encode_with_awq("query")  # this bundle
sim = cosine_similarity([upstream_emb], [awq_emb])
# AWQ INT4 typically retains 95-98% embedding fidelity

NOESIS provenance

This bundle was produced as a community contribution during the NOESIS DHCF-FNO development cycle. Not used in the NOESIS dubbing pipeline directly — NOESIS uses BidirLM-Omni-2.5B-NF4 (2.62 GB, cross-modal text+image+audio) instead, which is more compact and supports multi-modal queries.

Sister AWQ-INT4 bundles in the same chain (autoawq recipe, 64 samples × 384 seq calibration):

License

Apache License 2.0 (inherited from upstream Qwen/Qwen3-Embedding-8B).

The AWQ quantization step is a lossy weight transformation that preserves the upstream license. NOESIS storage layer © AMAImedia 2026 (DHCF-FNO project).

Citation

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}

Produced 2026-05-18 by NOESIS DHCF-FNO v15.7 — AMAImedia.com

Downloads last month
382
Safetensors
Model size
8B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4

Quantized
(33)
this model

Paper for AMAImedia/Qwen3-Embedding-8B-NOESIS-AWQ-INT4