PEEK: Picking Essential frames as Efficient Knowledge distillation
This repository hosts the pretrained weights for PEEK: Picking Essential frames via Efficient Knowledge distillation, a query-free frame selector for low-budget video captioning.
- Code (Apache-2.0): https://github.com/momentslab/peek
- Weights (this repo, CC-BY-NC-SA-4.0):
peek_base.safetensors
What this is
PEEK receives only video frames at inference time. Given a budget of k frames it predicts per-frame relevance scores and returns the selected frames in temporal order, ready to be forwarded to a downstream Video-Language Model.
The released peek_base checkpoint is a 2-layer Transformer (~1.7M parameters)
trained on ActivityNet Captions train.json by distilling a frozen,
caption-conditioned SigLIP2 teacher into a query-free scorer over frozen
MobileCLIP2-S0 frame embeddings.
Usage
Install the (Apache-licensed) code, then let it pull these weights automatically:
pip install git+https://github.com/momentslab/peek
from pathlib import Path
from peek.inference import load_peek_pipeline, select_frames_from_video
# Downloads peek_base.safetensors from this repo on first use.
encoder, scorer, device = load_peek_pipeline(device="cuda")
output = select_frames_from_video(
Path("video.mp4"),
encoder=encoder, scorer=scorer, device=device,
k=4, fps=2.0,
)
print(output.selected_indices, output.selected_timestamps_sec)
Or from the CLI:
python scripts/infer.py path/to/video.mp4 --k 4
Files
| File | Description |
|---|---|
peek_base.safetensors |
PEEK scorer weights (model-only, ~7 MB). |
config.json |
Architecture and training metadata. |
LICENSE |
CC-BY-NC-SA-4.0 legal text. |
Training details
- Train data: ActivityNet Captions
train.jsonsegments only. - Encoder: MobileCLIP2-S0 (frozen), 512-d features per frame, via
open_clip. - Teacher: SigLIP2 SO400M patch14 384 (frozen).
- Loss: ListMLE on min-max normalized teacher cosine similarities.
- Inference selection policy: stratified argmax.
Citation
@inproceedings{steunou2026peek,
title={{PEEK}: Picking Essential frames via Efficient Knowledge distillation},
author={Steunou, Killian and Filali Razzouki, Anas and Guetari, Khalil and El-Yacoubi, Moun\^im A. and Tevissen, Yannis},
year={2026}
}
- Downloads last month
- 10