Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
tags:
|
| 5 |
+
- robotics
|
| 6 |
+
- libero
|
| 7 |
+
- vision-language-action
|
| 8 |
+
- imitation-learning
|
| 9 |
+
- manipulation
|
| 10 |
+
datasets:
|
| 11 |
+
- gate-institute/GATE-VLAP-datasets
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# GATE-VLAP: Grounded Action Trajectory Embeddings with Vision-Language Action Planning
|
| 15 |
+
|
| 16 |
+
**Trained on LIBERO-10 Benchmark**
|
| 17 |
+
|
| 18 |
+
This model is trained for robotic manipulation tasks using vision-language-action learning with semantic action chunking.
|
| 19 |
+
|
| 20 |
+
## Model Details
|
| 21 |
+
|
| 22 |
+
- **Architecture**: CLIP-RT (CLIP-based Robot Transformer)
|
| 23 |
+
- **Training Dataset**: [GATE-VLAP LIBERO-10](https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets)
|
| 24 |
+
- **Training Epochs**: 90
|
| 25 |
+
- **Task Type**: Long-horizon robotic manipulation
|
| 26 |
+
- **Input**: RGB images (128×128) + language instructions
|
| 27 |
+
- **Output**: 7-DOF actions (xyz, rpy, gripper)
|
| 28 |
+
|
| 29 |
+
## Training Details
|
| 30 |
+
|
| 31 |
+
- **Dataset**: LIBERO-10 (29 subtasks, 1,354 demonstrations)
|
| 32 |
+
- **Segmentation**: Semantic action chunking using Gemini Vision API
|
| 33 |
+
- **Framework**: PyTorch
|
| 34 |
+
- **Checkpoint**: Epoch 90
|
| 35 |
+
|
| 36 |
+
## Usage
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
import torch
|
| 40 |
+
from pathlib import Path
|
| 41 |
+
|
| 42 |
+
# Load checkpoint
|
| 43 |
+
checkpoint = torch.load(
|
| 44 |
+
"checkpoints/libero_10_fixed_training_v1/epoch_90.pt",
|
| 45 |
+
map_location="cuda"
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
# Extract model state
|
| 49 |
+
model_state = checkpoint['model_state_dict']
|
| 50 |
+
|
| 51 |
+
# TODO: Add inference code here
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Performance
|
| 55 |
+
|
| 56 |
+
Training run: `libero_10_fixed_training_v1`
|
| 57 |
+
|
| 58 |
+
*Add your metrics here after evaluation*
|
| 59 |
+
|
| 60 |
+
## Dataset
|
| 61 |
+
|
| 62 |
+
This model was trained on the [GATE-VLAP Datasets](https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets), which includes:
|
| 63 |
+
- LIBERO-10: 103,650 frames across 29 subtasks
|
| 64 |
+
- Semantic action segmentation
|
| 65 |
+
- Vision-language annotations
|
| 66 |
+
|
| 67 |
+
## Citation
|
| 68 |
+
|
| 69 |
+
```bibtex
|
| 70 |
+
@article{gateVLAP2024,
|
| 71 |
+
title={GATE-VLAP: Grounded Action Trajectory Embeddings with Vision-Language Action Planning},
|
| 72 |
+
author={[Your Name]},
|
| 73 |
+
journal={arXiv preprint arXiv:XXXX.XXXXX},
|
| 74 |
+
year={2024}
|
| 75 |
+
}
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
## Maintainer
|
| 79 |
+
|
| 80 |
+
**GATE Institute** - Advanced AI Research Group, Sofia, Bulgaria
|
| 81 |
+
|
| 82 |
+
## Links
|
| 83 |
+
|
| 84 |
+
- 🤗 **Dataset**: [gate-institute/GATE-VLAP-datasets](https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets)
|
| 85 |
+
- 📄 **Paper**: *Coming soon*
|
| 86 |
+
- 💻 **Code**: *Add your GitHub repo here*
|
| 87 |
+
|
| 88 |
+
## License
|
| 89 |
+
|
| 90 |
+
MIT License
|