EgoBlind-RA โ CLIP Urgency Classifier
Binary urgency classifier for egocentric BLV (blind / low-vision) visual assistance queries. Predicts whether a (video, question) pair is urgent (safety-critical, demands a fast concise response) or non-urgent.
Component of the EgoBlind-RA project.
Architecture
- Backbone: CLIP ViT-B/32 (OpenAI weights), frozen
- Input: 4 frames uniformly sampled from a ยฑ2-second window centered at the query timestamp, plus the question text
- Frame embeddings are mean-pooled and concatenated with the CLIP text embedding
- Two-layer MLP head outputs a binary urgency score
Only the MLP head is trained; the CLIP backbone is frozen throughout.
Training
- Dataset: EgoBlind, with urgency labels generated by GPT-5.2 on 5 frames per clip
- Loss: binary cross-entropy
- Optimizer: AdamW, lr = 1e-4
- 5 epochs, NVIDIA L40S GPU
Performance
| Metric | Validation | Test |
|---|---|---|
| Accuracy | 0.863 | 0.798 |
| Precision | 0.930 | 0.879 |
| Recall | 0.807 | 0.695 |
| F1 | 0.864 | 0.777 |
| ROC-AUC | 0.938 | 0.905 |
Usage
import torch
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(
repo_id="julia225/egoblind-ra-clip-urgency",
filename="final_model.pt",
)
state = torch.load(ckpt, map_location="cpu")
# Load into the MLP head defined in
# https://github.com/juliavekim/EgoBlind-RA/blob/main/models/clip_urgency_classifier.ipynb
Citation
If you use this classifier, please cite the EgoBlind-RA project: @misc{kim2026egoblindra, title = {EgoBlind-RA: Towards Safer Egocentric Assistive AI for Blind Users via Risk-Adaptive Routing}, author = {Kim, Julia and Backus, Xander}, year = {2026}, url = {https://github.com/juliavekim/EgoBlind-RA}, }
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support