EgoBlind-RA — CLIP Urgency Classifier

Binary urgency classifier for egocentric BLV (blind / low-vision) visual assistance queries. Predicts whether a (video, question) pair is urgent (safety-critical, demands a fast concise response) or non-urgent.

Component of the EgoBlind-RA project.

Architecture

Backbone: CLIP ViT-B/32 (OpenAI weights), frozen
Input: 4 frames uniformly sampled from a ±2-second window centered at the query timestamp, plus the question text
Frame embeddings are mean-pooled and concatenated with the CLIP text embedding
Two-layer MLP head outputs a binary urgency score

Only the MLP head is trained; the CLIP backbone is frozen throughout.

Training

Dataset: EgoBlind, with urgency labels generated by GPT-5.2 on 5 frames per clip
Loss: binary cross-entropy
Optimizer: AdamW, lr = 1e-4
5 epochs, NVIDIA L40S GPU

Performance

Metric	Validation	Test
Accuracy	0.863	0.798
Precision	0.930	0.879
Recall	0.807	0.695
F1	0.864	0.777
ROC-AUC	0.938	0.905

Usage

import torch
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="julia225/egoblind-ra-clip-urgency",
    filename="final_model.pt",
)
state = torch.load(ckpt, map_location="cpu")
# Load into the MLP head defined in
# https://github.com/juliavekim/EgoBlind-RA/blob/main/models/clip_urgency_classifier.ipynb

Citation

If you use this classifier, please cite the EgoBlind-RA project: @misc{kim2026egoblindra, title = {EgoBlind-RA: Towards Safer Egocentric Assistive AI for Blind Users via Risk-Adaptive Routing}, author = {Kim, Julia and Backus, Xander}, year = {2026}, url = {https://github.com/juliavekim/EgoBlind-RA}, }

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support