kompress-v2-base

Extractive prompt compressor for LLM proxies. Predicts a keep/drop label per token; the surviving tokens form a compressed version of the input that preserves meaning while reducing token count.

Based on ModernBERT-base (149M params) with a LoRA adapter (3.4M trainable params, 2.2%) plus a custom dual head (token classifier + 1-D span conv). Trained on 126,617 accepted Pipeline A+B labels (compressor + faithfulness judge) across 17 domains: narrative, dialog, code, agent traces, healthcare, finance, government, scientific, web, summary, and tool-calling.

Quick start

import torch
from transformers import AutoTokenizer

# Option A: load the merged checkpoint (no LoRA needed)
state = torch.load("merged.pt", map_location="cpu")

# Option B: load via the kompress package
from kompress.model.architecture import HeadroomCompressorV2
from kompress.model.config import V2_BASE
import json

with open("config.json") as f:
    cfg_dict = json.load(f)
cfg = V2_BASE  # or rebuild from cfg_dict
model = HeadroomCompressorV2(cfg)
model.load_state_dict(torch.load("merged.pt", map_location="cpu"), strict=False)
model.eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("chopratejas/kompress-v2-base")

# Compress
text = "The quick brown fox jumps over the lazy dog."
enc = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model(**enc)
scores = out["final_scores"][0]    # P(keep) per subword
keep = (scores >= 0.5)
kept_tokens = enc["input_ids"][0][keep]
print(tokenizer.decode(kept_tokens, skip_special_tokens=True))

Threshold tuning

The model emits final_scores ∈ [0, 1] per subword. Adjust the threshold to trade compression aggressiveness for must-keep recall.

Threshold keep_rate must_keep_recall F1 best for
0.30 0.917 (8% drop) 0.994 0.904 Conservative
0.40 0.867 (13% drop) 0.987 0.913 Safe
0.50 (default) 0.815 (18% drop) 0.974 0.918 Balanced
0.60 0.765 (23% drop) 0.950 0.915 Aggressive
0.70 0.705 (30% drop) 0.908 0.898 Very aggressive

Evaluated on the held-out test split (n=7,037 examples, stratified by domain).

Training data

  • 126,617 labeled examples after min_drop_ratio=0.05 filtering and same-conversation packing.
  • Sources: arxiv, pubmed-scientific, govreport, swe-smith, swe-gym-openhands, toolmind, xlam-fc, fineweb-edu, cnn-dailymail, xsum, glaive-fc, lmsys-chat, claude-code-sessions, meetingbank, the-stack-smol-md, samsum, swe-bench-verified.
  • Labeler: DeepSeek-V4-Flash (compressor) + DeepSeek-V4-Pro (judge) with Pipeline A + B faithfulness loop. Hard-keep overlay enforces names, dates, numbers, URLs, code identifiers via GLiNER + regex + lexicons.
  • Bucket split: short=48%, mid=31%, long=21% (max_length 8,192 native ModernBERT context).
  • Split: train=126,617 / val=7,037 / test=7,037.

Training details

  • Base: ModernBERT-base (149M params)
  • Encoder fine-tuning: LoRA (r=16, alpha=32, target_modules=Wqkv/Wi/Wo)
  • Heads: per-token CE (must-keep loss weight = 3.0) + 1-D span conv (BCE, weight 0.3 on total loss)
  • Trainable params: 3.4M (2.2% of total)
  • Loss: weighted cross-entropy on token head + BCE-with-logits on span head
  • Optim: AdamW (lr=2e-4 cosine, warmup_ratio=0.06, weight_decay=0.01)
  • Effective batch: 48 (12 × 4 grad-accum)
  • Epochs: 3
  • Precision: bf16 with FlashAttention-2 + gradient checkpointing
  • Hardware: 1×H100 80GB, ~39 min wall-clock

Final metrics (test split, threshold=0.5)

  • eval_f1: 0.918
  • eval_must_keep_recall: 0.974
  • eval_keep_rate: 0.815 (18% compression)
  • eval_loss: 0.34

Files in this repo

config.json           # KompressV2Config + arch metadata
model.safetensors     # ~600 MB — best checkpoint, LoRA merged into the encoder
merged.pt             # ~600 MB — full state dict, alias for safetensors load
tokenizer.json        # ModernBERT-base tokenizer
tokenizer_config.json
special_tokens_map.json
adapter/              # LoRA adapter ONLY (~30 MB), for stacking per-org adapters
  adapter_config.json
  adapter_model.safetensors
  token_head.pt
  span_conv.pt
export_coreml.py      # CoreML conversion script (added for Apple Silicon optimization)
coreml/               # Compiled CoreML packages
  kompress.mlpackage  # Optimized ANE-ready CoreML model
README.md             # this file

CoreML Support (Apple Neural Engine / GPU)

This repository includes support for compiling and running kompress-v2-base natively on Apple Silicon hardware (M-series Macs, iOS 17+, macOS 14+) via CoreML.

Exporting the Model

You can re-export or customize the CoreML conversion by running the export script:

pip install coremltools torch transformers numpy
python3 export_coreml.py

Swift Integration

Load and run the compiled .mlpackage in your Swift application using the MLMultiArray interface:

import CoreML

// inputIds: [Int32], attentionMask: [Int32]
let idsArray = try MLMultiArray(shape: [1, NSNumber(value: inputIds.count)], dataType: .int32)
let maskArray = try MLMultiArray(shape: [1, NSNumber(value: attentionMask.count)], dataType: .int32)

// Populate arrays and call prediction...
let model = try kompress(configuration: MLModelConfiguration())
let output = try model.prediction(input_ids: idsArray, attention_mask: maskArray)
let scores = output.final_scores // per-token Float32 P(keep)

License

Apache 2.0. Free for commercial use. ModernBERT base is also Apache 2.0.

See also

Downloads last month
80
Safetensors
Model size
0.2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zacharyvmm/kompress-v2-base-coreml

Adapter
(38)
this model