kompress-v2-base

Extractive prompt compressor for LLM proxies. Predicts a keep/drop label per token; the surviving tokens form a compressed version of the input that preserves meaning while reducing token count.

Based on ModernBERT-base (149M params) with a LoRA adapter (3.4M trainable params, 2.2%) plus a custom dual head (token classifier + 1-D span conv). Trained on 126,617 accepted Pipeline A+B labels (compressor + faithfulness judge) across 17 domains: narrative, dialog, code, agent traces, healthcare, finance, government, scientific, web, summary, and tool-calling.

Quick start

import torch
from transformers import AutoTokenizer

# Option A: load the merged checkpoint (no LoRA needed)
state = torch.load("merged.pt", map_location="cpu")

# Option B: load via the kompress package
from kompress.model.architecture import HeadroomCompressorV2
from kompress.model.config import V2_BASE
import json

with open("config.json") as f:
    cfg_dict = json.load(f)
cfg = V2_BASE  # or rebuild from cfg_dict
model = HeadroomCompressorV2(cfg)
model.load_state_dict(torch.load("merged.pt", map_location="cpu"), strict=False)
model.eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("chopratejas/kompress-v2-base")

# Compress
text = "The quick brown fox jumps over the lazy dog."
enc = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model(**enc)
scores = out["final_scores"][0]    # P(keep) per subword
keep = (scores >= 0.5)
kept_tokens = enc["input_ids"][0][keep]
print(tokenizer.decode(kept_tokens, skip_special_tokens=True))

Threshold tuning

The model emits final_scores ∈ [0, 1] per subword. Adjust the threshold to trade compression aggressiveness for must-keep recall.

Threshold	keep_rate	must_keep_recall	F1	best for
0.30	0.917 (8% drop)	0.994	0.904	Conservative
0.40	0.867 (13% drop)	0.987	0.913	Safe
0.50 (default)	0.815 (18% drop)	0.974	0.918	Balanced
0.60	0.765 (23% drop)	0.950	0.915	Aggressive
0.70	0.705 (30% drop)	0.908	0.898	Very aggressive

Evaluated on the held-out test split (n=7,037 examples, stratified by domain).

Training data

126,617 labeled examples after min_drop_ratio=0.05 filtering and same-conversation packing.
Sources: arxiv, pubmed-scientific, govreport, swe-smith, swe-gym-openhands, toolmind, xlam-fc, fineweb-edu, cnn-dailymail, xsum, glaive-fc, lmsys-chat, claude-code-sessions, meetingbank, the-stack-smol-md, samsum, swe-bench-verified.
Labeler: DeepSeek-V4-Flash (compressor) + DeepSeek-V4-Pro (judge) with Pipeline A + B faithfulness loop. Hard-keep overlay enforces names, dates, numbers, URLs, code identifiers via GLiNER + regex + lexicons.
Bucket split: short=48%, mid=31%, long=21% (max_length 8,192 native ModernBERT context).
Split: train=126,617 / val=7,037 / test=7,037.

Training details

Base: ModernBERT-base (149M params)
Encoder fine-tuning: LoRA (r=16, alpha=32, target_modules=Wqkv/Wi/Wo)
Heads: per-token CE (must-keep loss weight = 3.0) + 1-D span conv (BCE, weight 0.3 on total loss)
Trainable params: 3.4M (2.2% of total)
Loss: weighted cross-entropy on token head + BCE-with-logits on span head
Optim: AdamW (lr=2e-4 cosine, warmup_ratio=0.06, weight_decay=0.01)
Effective batch: 48 (12 × 4 grad-accum)
Epochs: 3
Precision: bf16 with FlashAttention-2 + gradient checkpointing
Hardware: 1×H100 80GB, ~39 min wall-clock

Final metrics (test split, threshold=0.5)

eval_f1: 0.918
eval_must_keep_recall: 0.974
eval_keep_rate: 0.815 (18% compression)
eval_loss: 0.34

Files in this repo

config.json           # KompressV2Config + arch metadata
model.safetensors     # ~600 MB — best checkpoint, LoRA merged into the encoder
merged.pt             # ~600 MB — full state dict, alias for safetensors load
tokenizer.json        # ModernBERT-base tokenizer
tokenizer_config.json
special_tokens_map.json
adapter/              # LoRA adapter ONLY (~30 MB), for stacking per-org adapters
  adapter_config.json
  adapter_model.safetensors
  token_head.pt
  span_conv.pt
export_coreml.py      # CoreML conversion script (added for Apple Silicon optimization)
coreml/               # Compiled CoreML packages
  kompress.mlpackage  # Optimized ANE-ready CoreML model
README.md             # this file

CoreML Support (Apple Neural Engine / GPU)

This repository includes support for compiling and running kompress-v2-base natively on Apple Silicon hardware (M-series Macs, iOS 17+, macOS 14+) via CoreML.

Exporting the Model

You can re-export or customize the CoreML conversion by running the export script:

pip install coremltools torch transformers numpy
python3 export_coreml.py

Swift Integration

Load and run the compiled .mlpackage in your Swift application using the MLMultiArray interface:

import CoreML

// inputIds: [Int32], attentionMask: [Int32]
let idsArray = try MLMultiArray(shape: [1, NSNumber(value: inputIds.count)], dataType: .int32)
let maskArray = try MLMultiArray(shape: [1, NSNumber(value: attentionMask.count)], dataType: .int32)

// Populate arrays and call prediction...
let model = try kompress(configuration: MLModelConfiguration())
let output = try model.prediction(input_ids: idsArray, attention_mask: maskArray)
let scores = output.final_scores // per-token Float32 P(keep)

License

Apache 2.0. Free for commercial use. ModernBERT base is also Apache 2.0.

Model tree for zacharyvmm/kompress-v2-base-coreml

Base model

answerdotai/ModernBERT-base

Adapter

(38)

this model

zacharyvmm
/

kompress-v2-base-coreml