POG-E4B β€” Project Omni-Gemma dense retrieval adapters

35M-parameter Matryoshka retrieval adapters riding on a frozen 4-bit Gemma-4 E4B QAT backbone β€” trained without ever holding the backbone in training memory, evaluated against zembed-1 head-to-head, compressed down to 320-byte binary vectors at 99% quality.

POG reframes a quantized instruction LLM as a frozen nonlinear basis function: the unsloth/gemma-4-E4B-it-qat-GGUF (UD-Q4_K_XL) backbone is never trained. Residual-stream hidden states are intercepted mid-network at layers {28, 32, 36} of 42 β€” the semantic plateau, before terminal-layer anisotropy β€” via a custom ggml scheduler eval-callback inside llama.cpp (extractor/pog-extract.cpp, included, with a persistent --serve mode). All task optimization lives in the adapter, trained from NVMe-cached features.

Two checkpoints ship here:

training data use it when
adapter.safetensors (v1, flagship) 80k MS MARCO queries, cross-encoder-ensemble margin distillation best measured quality, in- and out-of-domain
v21/ (v2.1 variant) 261k queries / 5 datasets (MS MARCO + NQ + HotpotQA + Quora + TriviaQA), POG-self-mined negatives on MS MARCO multi-domain training base for further fine-tuning

Architecture

text ── llama.cpp prefill (frozen UD-Q4_K_XL, LM head skipped, layers 38-41 trimmed)
          β”‚  eval-callback intercepts l_out-{28,32,36}   [3 Γ— n_tok Γ— 2560]
          β–Ό  cached pooling slots per layer: prefix-mean Β· 4 segment-means Β· last-token
feats [3, 6, 2560]
          β”‚ 1. learned per-slot softmax layer mix (Ξ³)
          β”‚ 2. RMSNorm  (Q4 outlier stabilization)
          β”‚ 3. instruction-aware attention pooling  (prefix queries content slots)
          β”‚ 4. X-GLU fusion of attention-pooled vs mean-pooled views
          β”‚ 5. residual SwiGLU bottleneck  2560 β†’ 3072 β†’ 2560
          β”‚ 6. semi-orthogonal Matryoshka head  β†’ 2560
          β–Ό
embedding ∈ R^2560   (truncate to 1024 / 512 / 256 and re-normalize)

True early exit: extractor/trim_gguf.py rewrites the GGUF without layers 38-41 β€” slicing the fused per-layer-embedding tensors and patching the shared-KV metadata β€” verified bit-identical (max abs diff 0.0) at all interception layers. 3.8 GB model, ~10% faster prefill, and the 262k-vocab LM head is skipped by requesting zero outputs. Sustained throughput: 9,000 tok/s on one RTX 5080.

Training (v1 flagship)

  • Data: MS MARCO, 80k queries Γ— (1 positive + 4 hard negatives) with continuous cross-encoder ensemble margins (sentence-transformers/msmarco, bert-ensemble-margin-mse).
  • Loss: MarginMSE distillation + Matryoshka Nested Contrastive (in-batch InfoNCE, learned temperature) at every truncation m ∈ {256, 512, 1024, 2560}, low dims weighted hardest.
  • Phase 1 (feature caching): pure inference, ~4.5 GB VRAM. Phase 2 (adapter): 22 min, < 7 GB VRAM, 10,240 in-batch documents per step β€” the backbone is not in memory at all.

Evaluation

All numbers NDCG@10, measured locally with the identical harness, corpora, and metric code (eval/, raw results JSON included). The harness is anchored: all-MiniLM-L6-v2 reproduces its canonical MTEB scores in it (NFCorpus 0.317β‰ˆ0.32, SciFact 0.645β‰ˆ0.645, SciDocs 0.216β‰ˆ0.216), and zembed-1 reproduces its model-card example and is batch-consistent.

In-domain: MS MARCO dev (held-out; 2,000 queries, 60,335-passage corpus)

Model dim NDCG@10 R@10
raw backbone L32 mean-pool (zero-shot) 2560 0.097 0.160
zembed-1 (zero-shot) 2560 0.228 0.361
all-MiniLM-L6-v2 (trained on MS MARCO β€” contaminated anchor) 384 0.926 0.994
POG-v1 256 0.746 0.877
POG-v1 2560 0.770 0.900
POG-v2.1 256 0.746 0.889
POG-v2.1 2560 0.770 0.906

7.9Γ— NDCG over the raw frozen features; the 256-dim prefix keeps 96.9% of full quality.

Out-of-domain: BEIR zero-shot

Task POG-v1@2560 POG-v2.1@2560 zembed-1@2560 MiniLM
NFCorpus 0.231 0.216 0.180 0.317
SciFact 0.494 0.489 0.404 0.645
FiQA-2018 0.182 0.171 0.075 0.369
ArguAna 0.334 0.312 0.058 0.370
SciDocs 0.129 0.128 0.045 0.216

POG beats zembed-1 on all five BEIR tasks and in-domain at 1/114th the trainable parameters. zembed-1's public-BEIR weakness mirrors its own README (it trails its base model on Web Search; it targets ZeroEntropy's private distributions). A 22M MiniLM trained on 1B+ pairs still wins zero-shot transfer β€” the gap is data scale, not architecture.

The v2 ablation: what we learned scaling to 5 datasets

v2 (naive): 261k queries, hard negatives self-mined by POG-v1 across the full pool β†’ transfer regressed (NFCorpus 0.192, SciFact 0.464, FiQA 0.157). Wiki-QA questions have many valid answer passages; top-k self-mining labels true positives as negatives, and the 0.95-relative false-negative guard cannot catch them all.

v2.1 (fixed): mined negatives kept only for MS MARCO (shallow-passage regime where mining is sound), curated seed negatives for HotpotQA/Quora, in-batch only for NQ/TriviaQA β†’ recovered to parity in-domain and near-v1 transfer, with better recall. Takeaway: at this scale, one high-quality margin-distillation source transfers better than diluted multi-dataset training, and self-mining needs a corpus whose label structure supports it. Both checkpoints + the full pipeline (training/) are included so you can push past this.

Vector compression (POG-v1 flagship, same harness)

Encoding bytes/vector MS MARCO dev % of fp16
fp16 @2560 5,120 0.7702 100%
int8 @2560 2,560 0.7701 100.0%
binary @2560, fp32 queries 320 0.762 98.9%
binary @2560 symmetric 320 0.752 97.6%
int8 @256 256 0.745 96.7%
binary @256, fp32 queries 32 0.674 87.4%
binary @256 symmetric 32 0.592 76.9%

int8 is free at any dimension. A 320-byte binary document vector retains 99% of retrieval quality β€” 16Γ— smaller than fp16, 32Γ— smaller than fp32. Full per-task tables in eval/results_quant.json.

adapter.int8.safetensors: per-channel INT8 adapter weights (35 MB vs 142 MB), embedding cosine to fp32 β‰₯ 0.9998.

Usage

g++ -O2 -std=c++17 extractor/pog-extract.cpp \
    -I$LLAMA_CPP/include -I$LLAMA_CPP/ggml/include \
    -L$LLAMA_CPP/build/bin -lllama -lggml -lggml-base \
    -Wl,-rpath,$LLAMA_CPP/build/bin -o pog-extract
hf download unsloth/gemma-4-E4B-it-qat-GGUF gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf
python extractor/trim_gguf.py gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf gemma-4-E4B-L38.gguf 38  # optional, ~10% faster
from pog_embed import POGEmbedder

emb = POGEmbedder(gguf_path="gemma-4-E4B-L38.gguf", adapter_path=".",
                  extractor_bin="./pog-extract")   # persistent server mode by default
q = emb.encode_query(["what is backpropagation"], dim=256)
d = emb.encode_document(["Backpropagation computes gradients...",
                         "The Eiffel tower is in Paris."], dim=256)
print(q @ d.T)

Honest deviations from the original POG blueprint

Blueprint This release Why
5M synthetic multimodal triplets 80k-261k text queries single-GPU budget; multimodal slots reserved in the format
<think>-trace dual-phase pooling X-GLU fuses attention-pooled vs mean-pooled views embedding inputs are raw text, not generated traces
>70B Elo cross-encoder ensemble BERT cross-encoder ensemble margins strongest public continuous-margin set for MS MARCO
intercept ffn_down intercept l_out (full residual stream) strictly more information, same cost
early-exit after layer 36 GGUF trimmed to 38 layers (bit-exact) + LM head skip graph is static per ubatch; layer 37 kept so l_out-36 isn't row-filtered
d=3072 d=2560 (actual E4B width) blueprint figure didn't match the shipped model
"8 KB β†’ <128 B" compression 5,120 B β†’ 320 B at 99% (32 B at 86.5%) measured, not projected

Files

  • adapter.safetensors / adapter.int8.safetensors / adapter_config.json β€” v1 flagship
  • v21/ β€” v2.1 multi-dataset variant (fp32 + int8 + config)
  • extractor/ β€” pog-extract.cpp (eval-callback interception + --serve), trim_gguf.py
  • pog_adapter.py, pog_embed.py β€” model + end-to-end embedder
  • training/ β€” full reproducible pipeline: data prep (v1+v2), mining, training, int8 export
  • eval/ β€” harness + all raw results (results.json, results_indomain.json, results_quant.json)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LordAce9/POG-E4B-v1

Finetuned
(234)
this model