0.6b-4b LCLM, 16× compression
Latent Context Language Model: an encoder–decoder compressor described in End-to-End Context Compression at Scale.
The text to compress should be wrapped between <|memory_start|> and
<|memory_end|>.
Running these checkpoints requires the LCLM codebase: https://github.com/LeonLixyz/LCLM. Standard
transformers.AutoModel/vllm.LLMwill not load this format on its own.
Quick load
from latent_context import LCLM
model = LCLM.from_pretrained("latent-context/0.6b-4b-LCLM-16x")
prompt = (
"<|memory_start|>"
"<long document, code, or text to compress>"
"<|memory_end|> "
"Summarize the document above."
)
# model.generate(...) — see latent_context/inference/hf.py
vLLM serving (two-stage CLI)
The vLLM path runs the encoder and the decoder in separate
processes that hand off via a .pt file on disk. Running both in
one process OOMs — vLLM grabs all GPU memory at init, leaving none
for the HF encoder.
# Step 1: HF encoder over a jsonl of prompts → embeds.pt
python -m inference.vllm_inference.encode --checkpoint latent-context/0.6b-4b-LCLM-16x --prompts-jsonl prompts.jsonl --out embeds.pt
# Step 2: vLLM decoder reads embeds.pt → completions.jsonl
python -m inference.vllm_inference.decode --checkpoint latent-context/0.6b-4b-LCLM-16x --embeds-pt embeds.pt --out completions.jsonl
See inference/examples/README.md in the codebase for the
prompts.jsonl schema and an end-to-end RULER NIAH eval driver.
Configuration
| field | value |
|---|---|
| encoder | Qwen/Qwen3-Embedding-0.6B |
| decoder | Qwen/Qwen3-4B-Instruct-2507 |
| compression_ratio | 16 |
| encoder_window_size | 1024 |
| pooling | mean |
| encoder_mask_type | causal |
| boundary_overlap | 0 |
| adapter_type | mlp |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for latent-context/0.6b-4b-LCLM-16x
Base model
Qwen/Qwen3-4B-Instruct-2507