kompress-v7

Token compression classifier fine-tuned from PeetPedro/kompress-v6 using a sliding-window subtoken override fix. Part of the ultrawhale fine-tuning loop.

What changed from v6

v6 found that self-labeling agent data with compress_with_override collapsed mk_in_ref to 0.652. Root cause: the override checked individual subtokens — TokenExpiredError splits into Token+Expired+Error, none of which individually match the CamelCase pattern.

v7 fixes this with a sliding-window approach: the override now decodes 1, 2, and 3-token windows and checks the combined string. TokenExpiredError, /var/log/app.log, and --verbose all force-kept correctly.

Results

Metric	v7 base	v7 + override	vs v6
heretic exact_pct	0.949	0.956	regression
keep_rate	0.868	0.869	↑ more conservative
override_delta	—	+0.007	override needed again

The fix worked mechanically (mk_in_ref recovered) but the resulting training labels — with more tokens force-kept via sliding window — produced a more conservative model that needs the override again and scores lower on adversarial prompts. SSL bypass regressed: v6=0.789 → v7=0.684.

Loop conclusion

PeetPedro/kompress-v4 remains the production recommendation (heretic 0.967, override_delta=0). The agent-distribution fine-tuning direction (v5, v6, v7) consistently increases keep_rate and decreases precision. More agent training → more conservative → worse adversarial accuracy.

CONCLUSION

Sliding-window self-labeling regressed precision (0.967→0.956). Training for tokenization artifacts is the wrong approach.

USECASE

Proof that regex in production beats training for tokenization fixes.

Series

Version	heretic	keep_rate	override_delta	Notes
v4	0.967	0.823	0.000	production
v5	0.961	—	0.000	loop converged
v6	0.962	0.854	0.000	agent-distribution
v7	0.956	0.868	+0.007	sliding-window fix

Training code: ultrawhale

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for PeetPedro/kompress-v7

Base model

answerdotai/ModernBERT-base

Quantized

(37)

this model