POG-E4B β Project Omni-Gemma dense retrieval adapters
35M-parameter Matryoshka retrieval adapters riding on a frozen 4-bit Gemma-4 E4B QAT backbone β trained without ever holding the backbone in training memory, evaluated against zembed-1 head-to-head, compressed down to 320-byte binary vectors at 99% quality.
POG reframes a quantized instruction LLM as a frozen nonlinear basis function: the
unsloth/gemma-4-E4B-it-qat-GGUF (UD-Q4_K_XL) backbone is never trained. Residual-stream hidden
states are intercepted mid-network at layers {28, 32, 36} of 42 β the semantic plateau, before
terminal-layer anisotropy β via a custom ggml scheduler eval-callback inside llama.cpp
(extractor/pog-extract.cpp, included, with a persistent --serve mode). All task optimization
lives in the adapter, trained from NVMe-cached features.
Two checkpoints ship here:
| training data | use it when | |
|---|---|---|
adapter.safetensors (v1, flagship) |
80k MS MARCO queries, cross-encoder-ensemble margin distillation | best measured quality, in- and out-of-domain |
v21/ (v2.1 variant) |
261k queries / 5 datasets (MS MARCO + NQ + HotpotQA + Quora + TriviaQA), POG-self-mined negatives on MS MARCO | multi-domain training base for further fine-tuning |
Architecture
text ββ llama.cpp prefill (frozen UD-Q4_K_XL, LM head skipped, layers 38-41 trimmed)
β eval-callback intercepts l_out-{28,32,36} [3 Γ n_tok Γ 2560]
βΌ cached pooling slots per layer: prefix-mean Β· 4 segment-means Β· last-token
feats [3, 6, 2560]
β 1. learned per-slot softmax layer mix (Ξ³)
β 2. RMSNorm (Q4 outlier stabilization)
β 3. instruction-aware attention pooling (prefix queries content slots)
β 4. X-GLU fusion of attention-pooled vs mean-pooled views
β 5. residual SwiGLU bottleneck 2560 β 3072 β 2560
β 6. semi-orthogonal Matryoshka head β 2560
βΌ
embedding β R^2560 (truncate to 1024 / 512 / 256 and re-normalize)
True early exit: extractor/trim_gguf.py rewrites the GGUF without layers 38-41 β slicing the
fused per-layer-embedding tensors and patching the shared-KV metadata β verified bit-identical
(max abs diff 0.0) at all interception layers. 3.8 GB model, ~10% faster prefill, and the 262k-vocab
LM head is skipped by requesting zero outputs. Sustained throughput: 9,000 tok/s on one RTX 5080.
Training (v1 flagship)
- Data: MS MARCO, 80k queries Γ (1 positive + 4 hard negatives) with continuous
cross-encoder ensemble margins (
sentence-transformers/msmarco,bert-ensemble-margin-mse). - Loss: MarginMSE distillation + Matryoshka Nested Contrastive (in-batch InfoNCE, learned temperature) at every truncation m β {256, 512, 1024, 2560}, low dims weighted hardest.
- Phase 1 (feature caching): pure inference, ~4.5 GB VRAM. Phase 2 (adapter): 22 min, < 7 GB VRAM, 10,240 in-batch documents per step β the backbone is not in memory at all.
Evaluation
All numbers NDCG@10, measured locally with the identical harness, corpora, and metric code
(eval/, raw results JSON included). The harness is anchored: all-MiniLM-L6-v2 reproduces its
canonical MTEB scores in it (NFCorpus 0.317β0.32, SciFact 0.645β0.645, SciDocs 0.216β0.216), and
zembed-1 reproduces its model-card example and is batch-consistent.
In-domain: MS MARCO dev (held-out; 2,000 queries, 60,335-passage corpus)
| Model | dim | NDCG@10 | R@10 |
|---|---|---|---|
| raw backbone L32 mean-pool (zero-shot) | 2560 | 0.097 | 0.160 |
| zembed-1 (zero-shot) | 2560 | 0.228 | 0.361 |
| all-MiniLM-L6-v2 (trained on MS MARCO β contaminated anchor) | 384 | 0.926 | 0.994 |
| POG-v1 | 256 | 0.746 | 0.877 |
| POG-v1 | 2560 | 0.770 | 0.900 |
| POG-v2.1 | 256 | 0.746 | 0.889 |
| POG-v2.1 | 2560 | 0.770 | 0.906 |
7.9Γ NDCG over the raw frozen features; the 256-dim prefix keeps 96.9% of full quality.
Out-of-domain: BEIR zero-shot
| Task | POG-v1@2560 | POG-v2.1@2560 | zembed-1@2560 | MiniLM |
|---|---|---|---|---|
| NFCorpus | 0.231 | 0.216 | 0.180 | 0.317 |
| SciFact | 0.494 | 0.489 | 0.404 | 0.645 |
| FiQA-2018 | 0.182 | 0.171 | 0.075 | 0.369 |
| ArguAna | 0.334 | 0.312 | 0.058 | 0.370 |
| SciDocs | 0.129 | 0.128 | 0.045 | 0.216 |
POG beats zembed-1 on all five BEIR tasks and in-domain at 1/114th the trainable parameters. zembed-1's public-BEIR weakness mirrors its own README (it trails its base model on Web Search; it targets ZeroEntropy's private distributions). A 22M MiniLM trained on 1B+ pairs still wins zero-shot transfer β the gap is data scale, not architecture.
The v2 ablation: what we learned scaling to 5 datasets
v2 (naive): 261k queries, hard negatives self-mined by POG-v1 across the full pool β transfer regressed (NFCorpus 0.192, SciFact 0.464, FiQA 0.157). Wiki-QA questions have many valid answer passages; top-k self-mining labels true positives as negatives, and the 0.95-relative false-negative guard cannot catch them all.
v2.1 (fixed): mined negatives kept only for MS MARCO (shallow-passage regime where mining is
sound), curated seed negatives for HotpotQA/Quora, in-batch only for NQ/TriviaQA β recovered to
parity in-domain and near-v1 transfer, with better recall. Takeaway: at this scale, one
high-quality margin-distillation source transfers better than diluted multi-dataset training, and
self-mining needs a corpus whose label structure supports it. Both checkpoints + the full
pipeline (training/) are included so you can push past this.
Vector compression (POG-v1 flagship, same harness)
| Encoding | bytes/vector | MS MARCO dev | % of fp16 |
|---|---|---|---|
| fp16 @2560 | 5,120 | 0.7702 | 100% |
| int8 @2560 | 2,560 | 0.7701 | 100.0% |
| binary @2560, fp32 queries | 320 | 0.762 | 98.9% |
| binary @2560 symmetric | 320 | 0.752 | 97.6% |
| int8 @256 | 256 | 0.745 | 96.7% |
| binary @256, fp32 queries | 32 | 0.674 | 87.4% |
| binary @256 symmetric | 32 | 0.592 | 76.9% |
int8 is free at any dimension. A 320-byte binary document vector retains 99% of retrieval quality β
16Γ smaller than fp16, 32Γ smaller than fp32. Full per-task tables in eval/results_quant.json.
adapter.int8.safetensors: per-channel INT8 adapter weights (35 MB vs 142 MB), embedding cosine
to fp32 β₯ 0.9998.
Usage
g++ -O2 -std=c++17 extractor/pog-extract.cpp \
-I$LLAMA_CPP/include -I$LLAMA_CPP/ggml/include \
-L$LLAMA_CPP/build/bin -lllama -lggml -lggml-base \
-Wl,-rpath,$LLAMA_CPP/build/bin -o pog-extract
hf download unsloth/gemma-4-E4B-it-qat-GGUF gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf
python extractor/trim_gguf.py gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf gemma-4-E4B-L38.gguf 38 # optional, ~10% faster
from pog_embed import POGEmbedder
emb = POGEmbedder(gguf_path="gemma-4-E4B-L38.gguf", adapter_path=".",
extractor_bin="./pog-extract") # persistent server mode by default
q = emb.encode_query(["what is backpropagation"], dim=256)
d = emb.encode_document(["Backpropagation computes gradients...",
"The Eiffel tower is in Paris."], dim=256)
print(q @ d.T)
Honest deviations from the original POG blueprint
| Blueprint | This release | Why |
|---|---|---|
| 5M synthetic multimodal triplets | 80k-261k text queries | single-GPU budget; multimodal slots reserved in the format |
<think>-trace dual-phase pooling |
X-GLU fuses attention-pooled vs mean-pooled views | embedding inputs are raw text, not generated traces |
| >70B Elo cross-encoder ensemble | BERT cross-encoder ensemble margins | strongest public continuous-margin set for MS MARCO |
intercept ffn_down |
intercept l_out (full residual stream) |
strictly more information, same cost |
| early-exit after layer 36 | GGUF trimmed to 38 layers (bit-exact) + LM head skip | graph is static per ubatch; layer 37 kept so l_out-36 isn't row-filtered |
| d=3072 | d=2560 (actual E4B width) | blueprint figure didn't match the shipped model |
| "8 KB β <128 B" compression | 5,120 B β 320 B at 99% (32 B at 86.5%) | measured, not projected |
Files
adapter.safetensors/adapter.int8.safetensors/adapter_config.jsonβ v1 flagshipv21/β v2.1 multi-dataset variant (fp32 + int8 + config)extractor/βpog-extract.cpp(eval-callback interception +--serve),trim_gguf.pypog_adapter.py,pog_embed.pyβ model + end-to-end embeddertraining/β full reproducible pipeline: data prep (v1+v2), mining, training, int8 exporteval/β harness + all raw results (results.json,results_indomain.json,results_quant.json)