Enhance model card with metadata, links, and overview
Browse filesThis PR significantly enhances the model card by:
* Adding `pipeline_tag: robotics`, `library_name: lerobot`, and `license: apache-2.0` to the metadata, improving discoverability and providing crucial context.
* Including relevant `tags` such as `robotics`, `vision`, `gaze`, and `foveated-vision`.
* Listing associated Hugging Face `datasets` for a complete overview of the project's ecosystem.
* Providing direct links to the paper ([https://huggingface.co/papers/2507.15833](https://huggingface.co/papers/2507.15833)), project page ([https://ian-chuang.github.io/gaze-av-aloha/](https://ian-chuang.github.io/gaze-av-aloha/)), and GitHub repository ([https://github.com/ian-chuang/gaze-av-aloha](https://github.com/ian-chuang/gaze-av-aloha)).
* Adding a comprehensive overview of the model's capabilities, key highlights, and a visual (`hero.gif`).
* Including a basic Python code snippet for loading the model and guiding users to the GitHub repository for detailed usage and training instructions.
* Adding the academic citation for proper attribution.
This update provides users with a much richer and more informative resource on the Hugging Face Hub.
|
@@ -1,10 +1,84 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
tags:
|
| 3 |
- model_hub_mixin
|
| 4 |
- pytorch_model_hub_mixin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
- model_hub_mixin
|
| 5 |
- pytorch_model_hub_mixin
|
| 6 |
+
- robotics
|
| 7 |
+
- vision
|
| 8 |
+
- gaze
|
| 9 |
+
- foveated-vision
|
| 10 |
+
pipeline_tag: robotics
|
| 11 |
+
library_name: lerobot
|
| 12 |
+
datasets:
|
| 13 |
+
- iantc104/av_aloha_sim_peg_insertion
|
| 14 |
+
- iantc104/av_aloha_sim_cube_transfer
|
| 15 |
+
- iantc104/av_aloha_sim_thread_needle
|
| 16 |
+
- iantc104/av_aloha_sim_pour_test_tube
|
| 17 |
+
- iantc104/av_aloha_sim_hook_package
|
| 18 |
+
- iantc104/av_aloha_sim_slot_insertion
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
|
| 22 |
+
|
| 23 |
+
This model is part of the work presented in the paper **"Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers"**. This research explores how incorporating human-like active gaze into robotic policies can significantly enhance both efficiency and performance. It builds on recent advances in foveated image processing and applies them to an Active Vision robot system that emulates human head and eye tracking.
|
| 24 |
+
|
| 25 |
+
The framework integrates gaze information into Vision Transformers (ViTs) using a novel foveated patch tokenization scheme. This approach drastically reduces computational overhead (e.g., 94% reduction in ViT computation, 7x training speedup, 3x inference speedup reported) without sacrificing visual fidelity near regions of interest. It also improves performance for high-precision tasks and enhances robustness to unseen distractors.
|
| 26 |
+
|
| 27 |
+
<p align="center">
|
| 28 |
+
<img src="https://github.com/ian-chuang/gaze-av-aloha/raw/main/media/hero.gif" alt="Hero GIF" width="700">
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
* 📚 **Paper:** [Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers](https://huggingface.co/papers/2507.15833)
|
| 32 |
+
* 🌐 **Project Page:** [https://ian-chuang.github.io/gaze-av-aloha/](https://ian-chuang.github.io/gaze-av-aloha/)
|
| 33 |
+
* 💻 **Code:** [https://github.com/ian-chuang/gaze-av-aloha](https://github.com/ian-chuang/gaze-av-aloha)
|
| 34 |
+
|
| 35 |
+
## About the Model
|
| 36 |
+
|
| 37 |
+
This repository contains a pretrained gaze model (e.g., `iantc104/gaze_model_av_aloha_sim_thread_needle`) which is a core component of the "Look, Focus, Act" framework. These models are designed to predict human gaze to guide foveation and action in robotic tasks. The project provides a comprehensive framework for simultaneously collecting eye-tracking data and robot demonstrations, along with a simulation benchmark and dataset for training robot policies that incorporate human gaze.
|
| 38 |
+
|
| 39 |
+
## Associated Datasets
|
| 40 |
+
|
| 41 |
+
The model leverages the following AV-ALOHA simulation datasets with human eye-tracking annotations, which are available on Hugging Face:
|
| 42 |
+
|
| 43 |
+
* [AV ALOHA Sim Peg Insertion](https://huggingface.co/datasets/iantc104/av_aloha_sim_peg_insertion)
|
| 44 |
+
* [AV ALOHA Sim Cube Transfer](https://huggingface.co/datasets/iantc104/av_aloha_sim_cube_transfer)
|
| 45 |
+
* [AV ALOHA Sim Thread Needle](https://huggingface.co/datasets/iantc104/av_aloha_sim_thread_needle)
|
| 46 |
+
* [AV ALOHA Sim Pour Test Tube](https://huggingface.co/datasets/iantc104/av_aloha_sim_pour_test_tube)
|
| 47 |
+
* [AV ALOHA Sim Hook Package](https://huggingface.co/datasets/iantc104/av_aloha_sim_hook_package)
|
| 48 |
+
* [AV ALOHA Sim Slot Insertion](https://huggingface.co/datasets/iantc104/av_aloha_sim_slot_insertion)
|
| 49 |
+
|
| 50 |
+
## Usage
|
| 51 |
+
|
| 52 |
+
This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration. You can load it directly using `from_pretrained`.
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
import torch
|
| 56 |
+
from huggingface_hub import PyTorchModelHubMixin
|
| 57 |
+
|
| 58 |
+
# Load a specific pretrained gaze model, e.g., for the 'thread_needle' task
|
| 59 |
+
# Replace "iantc104/gaze_model_av_aloha_sim_thread_needle" with the actual model's repo ID
|
| 60 |
+
model = PyTorchModelHubMixin.from_pretrained("iantc104/gaze_model_av_aloha_sim_thread_needle")
|
| 61 |
+
model.eval()
|
| 62 |
+
|
| 63 |
+
# Further usage details, including how to integrate this model into a robotics pipeline,
|
| 64 |
+
# for training, evaluation, or data collection, can be found in the project's
|
| 65 |
+
# official GitHub repository: https://github.com/ian-chuang/gaze-av-aloha
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
For detailed installation instructions, dataset preparation, training scripts, and policy evaluation within the AV-ALOHA benchmark, please refer to the comprehensive documentation on the [project's GitHub repository](https://github.com/ian-chuang/gaze-av-aloha).
|
| 69 |
+
|
| 70 |
+
## Citation
|
| 71 |
+
|
| 72 |
+
If you find this work helpful or inspiring, please consider citing the paper:
|
| 73 |
+
|
| 74 |
+
```bibtex
|
| 75 |
+
@misc{chuang2025lookfocusactefficient,
|
| 76 |
+
title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
|
| 77 |
+
author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
|
| 78 |
+
year={2025},
|
| 79 |
+
eprint={2507.15833},
|
| 80 |
+
archivePrefix={arXiv},
|
| 81 |
+
primaryClass={cs.RO},
|
| 82 |
+
url={https://arxiv.org/abs/2507.15833},
|
| 83 |
+
}
|
| 84 |
+
```
|