PEEK: Picking Essential frames as Efficient Knowledge distillation

This repository hosts the pretrained weights for PEEK: Picking Essential frames via Efficient Knowledge distillation, a query-free frame selector for low-budget video captioning.

Code (Apache-2.0): https://github.com/momentslab/peek
Weights (this repo, CC-BY-NC-SA-4.0): peek_base.safetensors

What this is

PEEK receives only video frames at inference time. Given a budget of k frames it predicts per-frame relevance scores and returns the selected frames in temporal order, ready to be forwarded to a downstream Video-Language Model.

The released peek_base checkpoint is a 2-layer Transformer (~1.7M parameters) trained on ActivityNet Captions train.json by distilling a frozen, caption-conditioned SigLIP2 teacher into a query-free scorer over frozen MobileCLIP2-S0 frame embeddings.

Usage

Install the (Apache-licensed) code, then let it pull these weights automatically:

pip install git+https://github.com/momentslab/peek

from pathlib import Path
from peek.inference import load_peek_pipeline, select_frames_from_video

# Downloads peek_base.safetensors from this repo on first use.
encoder, scorer, device = load_peek_pipeline(device="cuda")

output = select_frames_from_video(
    Path("video.mp4"),
    encoder=encoder, scorer=scorer, device=device,
    k=4, fps=2.0,
)
print(output.selected_indices, output.selected_timestamps_sec)

Or from the CLI:

python scripts/infer.py path/to/video.mp4 --k 4

Files

File	Description
`peek_base.safetensors`	PEEK scorer weights (model-only, ~7 MB).
`config.json`	Architecture and training metadata.
`LICENSE`	CC-BY-NC-SA-4.0 legal text.

Training details

Train data: ActivityNet Captions train.json segments only.
Encoder: MobileCLIP2-S0 (frozen), 512-d features per frame, via open_clip.
Teacher: SigLIP2 SO400M patch14 384 (frozen).
Loss: ListMLE on min-max normalized teacher cosine similarities.
Inference selection policy: stratified argmax.

Citation

@inproceedings{steunou2026peek,
  title={{PEEK}: Picking Essential frames via Efficient Knowledge distillation},
  author={Steunou, Killian and Filali Razzouki, Anas and Guetari, Khalil and El-Yacoubi, Moun\^im A. and Tevissen, Yannis},
  year={2026}
}

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support