POG-E4B — Project Omni-Gemma dense retrieval adapters

35M-parameter Matryoshka retrieval adapters riding on a frozen 4-bit Gemma-4 E4B QAT backbone — trained without ever holding the backbone in training memory, evaluated against zembed-1 head-to-head, compressed down to 320-byte binary vectors at 99% quality.

POG reframes a quantized instruction LLM as a frozen nonlinear basis function: the unsloth/gemma-4-E4B-it-qat-GGUF (UD-Q4_K_XL) backbone is never trained. Residual-stream hidden states are intercepted mid-network at layers {28, 32, 36} of 42 — the semantic plateau, before terminal-layer anisotropy — via a custom ggml scheduler eval-callback inside llama.cpp (extractor/pog-extract.cpp, included, with a persistent --serve mode). All task optimization lives in the adapter, trained from NVMe-cached features.

Two checkpoints ship here:

	training data	use it when
`adapter.safetensors` (v1, flagship)	80k MS MARCO queries, cross-encoder-ensemble margin distillation	best measured quality, in- and out-of-domain
`v21/` (v2.1 variant)	261k queries / 5 datasets (MS MARCO + NQ + HotpotQA + Quora + TriviaQA), POG-self-mined negatives on MS MARCO	multi-domain training base for further fine-tuning

Architecture

text ── llama.cpp prefill (frozen UD-Q4_K_XL, LM head skipped, layers 38-41 trimmed)
          │  eval-callback intercepts l_out-{28,32,36}   [3 × n_tok × 2560]
          ▼  cached pooling slots per layer: prefix-mean · 4 segment-means · last-token
feats [3, 6, 2560]
          │ 1. learned per-slot softmax layer mix (γ)
          │ 2. RMSNorm  (Q4 outlier stabilization)
          │ 3. instruction-aware attention pooling  (prefix queries content slots)
          │ 4. X-GLU fusion of attention-pooled vs mean-pooled views
          │ 5. residual SwiGLU bottleneck  2560 → 3072 → 2560
          │ 6. semi-orthogonal Matryoshka head  → 2560
          ▼
embedding ∈ R^2560   (truncate to 1024 / 512 / 256 and re-normalize)

True early exit: extractor/trim_gguf.py rewrites the GGUF without layers 38-41 — slicing the fused per-layer-embedding tensors and patching the shared-KV metadata — verified bit-identical (max abs diff 0.0) at all interception layers. 3.8 GB model, ~10% faster prefill, and the 262k-vocab LM head is skipped by requesting zero outputs. Sustained throughput: 9,000 tok/s on one RTX 5080.

Training (v1 flagship)

Data: MS MARCO, 80k queries × (1 positive + 4 hard negatives) with continuous cross-encoder ensemble margins (sentence-transformers/msmarco, bert-ensemble-margin-mse).
Loss: MarginMSE distillation + Matryoshka Nested Contrastive (in-batch InfoNCE, learned temperature) at every truncation m ∈ {256, 512, 1024, 2560}, low dims weighted hardest.
Phase 1 (feature caching): pure inference, ~4.5 GB VRAM. Phase 2 (adapter): 22 min, < 7 GB VRAM, 10,240 in-batch documents per step — the backbone is not in memory at all.

Evaluation

All numbers NDCG@10, measured locally with the identical harness, corpora, and metric code (eval/, raw results JSON included). The harness is anchored: all-MiniLM-L6-v2 reproduces its canonical MTEB scores in it (NFCorpus 0.317≈0.32, SciFact 0.645≈0.645, SciDocs 0.216≈0.216), and zembed-1 reproduces its model-card example and is batch-consistent.

In-domain: MS MARCO dev (held-out; 2,000 queries, 60,335-passage corpus)

Model	dim	NDCG@10	R@10
raw backbone L32 mean-pool (zero-shot)	2560	0.097	0.160
zembed-1 (zero-shot)	2560	0.228	0.361
all-MiniLM-L6-v2 (trained on MS MARCO — contaminated anchor)	384	0.926	0.994
POG-v1	256	0.746	0.877
POG-v1	2560	0.770	0.900
POG-v2.1	256	0.746	0.889
POG-v2.1	2560	0.770	0.906

7.9× NDCG over the raw frozen features; the 256-dim prefix keeps 96.9% of full quality.

Out-of-domain: BEIR zero-shot

Task	POG-v1@2560	POG-v2.1@2560	zembed-1@2560	MiniLM
NFCorpus	0.231	0.216	0.180	0.317
SciFact	0.494	0.489	0.404	0.645
FiQA-2018	0.182	0.171	0.075	0.369
ArguAna	0.334	0.312	0.058	0.370
SciDocs	0.129	0.128	0.045	0.216

POG beats zembed-1 on all five BEIR tasks and in-domain at 1/114th the trainable parameters. zembed-1's public-BEIR weakness mirrors its own README (it trails its base model on Web Search; it targets ZeroEntropy's private distributions). A 22M MiniLM trained on 1B+ pairs still wins zero-shot transfer — the gap is data scale, not architecture.

The v2 ablation: what we learned scaling to 5 datasets

v2 (naive): 261k queries, hard negatives self-mined by POG-v1 across the full pool → transfer regressed (NFCorpus 0.192, SciFact 0.464, FiQA 0.157). Wiki-QA questions have many valid answer passages; top-k self-mining labels true positives as negatives, and the 0.95-relative false-negative guard cannot catch them all.

v2.1 (fixed): mined negatives kept only for MS MARCO (shallow-passage regime where mining is sound), curated seed negatives for HotpotQA/Quora, in-batch only for NQ/TriviaQA → recovered to parity in-domain and near-v1 transfer, with better recall. Takeaway: at this scale, one high-quality margin-distillation source transfers better than diluted multi-dataset training, and self-mining needs a corpus whose label structure supports it. Both checkpoints + the full pipeline (training/) are included so you can push past this.

Vector compression (POG-v1 flagship, same harness)

Encoding	bytes/vector	MS MARCO dev	% of fp16
fp16 @2560	5,120	0.7702	100%
int8 @2560	2,560	0.7701	100.0%
binary @2560, fp32 queries	320	0.762	98.9%
binary @2560 symmetric	320	0.752	97.6%
int8 @256	256	0.745	96.7%
binary @256, fp32 queries	32	0.674	87.4%
binary @256 symmetric	32	0.592	76.9%

int8 is free at any dimension. A 320-byte binary document vector retains 99% of retrieval quality — 16× smaller than fp16, 32× smaller than fp32. Full per-task tables in eval/results_quant.json.

adapter.int8.safetensors: per-channel INT8 adapter weights (35 MB vs 142 MB), embedding cosine to fp32 ≥ 0.9998.

Usage

g++ -O2 -std=c++17 extractor/pog-extract.cpp \
    -I$LLAMA_CPP/include -I$LLAMA_CPP/ggml/include \
    -L$LLAMA_CPP/build/bin -lllama -lggml -lggml-base \
    -Wl,-rpath,$LLAMA_CPP/build/bin -o pog-extract
hf download unsloth/gemma-4-E4B-it-qat-GGUF gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf
python extractor/trim_gguf.py gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf gemma-4-E4B-L38.gguf 38  # optional, ~10% faster

from pog_embed import POGEmbedder

emb = POGEmbedder(gguf_path="gemma-4-E4B-L38.gguf", adapter_path=".",
                  extractor_bin="./pog-extract")   # persistent server mode by default
q = emb.encode_query(["what is backpropagation"], dim=256)
d = emb.encode_document(["Backpropagation computes gradients...",
                         "The Eiffel tower is in Paris."], dim=256)
print(q @ d.T)

Honest deviations from the original POG blueprint

Blueprint	This release	Why
5M synthetic multimodal triplets	80k-261k text queries	single-GPU budget; multimodal slots reserved in the format
`<think>`-trace dual-phase pooling	X-GLU fuses attention-pooled vs mean-pooled views	embedding inputs are raw text, not generated traces
>70B Elo cross-encoder ensemble	BERT cross-encoder ensemble margins	strongest public continuous-margin set for MS MARCO
intercept `ffn_down`	intercept `l_out` (full residual stream)	strictly more information, same cost
early-exit after layer 36	GGUF trimmed to 38 layers (bit-exact) + LM head skip	graph is static per ubatch; layer 37 kept so l_out-36 isn't row-filtered
d=3072	d=2560 (actual E4B width)	blueprint figure didn't match the shipped model
"8 KB → <128 B" compression	5,120 B → 320 B at 99% (32 B at 86.5%)	measured, not projected

Files

adapter.safetensors / adapter.int8.safetensors / adapter_config.json — v1 flagship
v21/ — v2.1 multi-dataset variant (fp32 + int8 + config)
extractor/ — pog-extract.cpp (eval-callback interception + --serve), trim_gguf.py
pog_adapter.py, pog_embed.py — model + end-to-end embedder
training/ — full reproducible pipeline: data prep (v1+v2), mining, training, int8 export
eval/ — harness + all raw results (results.json, results_indomain.json, results_quant.json)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for LordAce9/POG-E4B-v1

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Finetuned

(234)

this model