Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
task_categories:
|
| 4 |
+
- video-classification
|
| 5 |
+
- reinforcement-learning
|
| 6 |
+
- robotics
|
| 7 |
+
language:
|
| 8 |
+
- en
|
| 9 |
+
tags:
|
| 10 |
+
- games
|
| 11 |
+
- maze
|
| 12 |
+
- sokoban
|
| 13 |
+
- 3d-navigation
|
| 14 |
+
- multimodal
|
| 15 |
+
- video
|
| 16 |
+
- planning
|
| 17 |
+
size_categories:
|
| 18 |
+
- 10K<n<100K
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# VR-Bench: A Multimodal Video Reasoning Benchmark
|
| 22 |
+
|
| 23 |
+
## Dataset Description
|
| 24 |
+
|
| 25 |
+
This is a multimodal dataset containing video demonstrations of game-playing scenarios across different game types including mazes, 3D mazes, Sokoban puzzles, and trap fields. The dataset is designed for training AI models on visual reasoning, planning, and sequential decision-making tasks.
|
| 26 |
+
|
| 27 |
+
## Dataset Structure
|
| 28 |
+
|
| 29 |
+
The dataset is organized into three main directories:
|
| 30 |
+
|
| 31 |
+
- `train_data/`: Training data with subdirectories for each game type and difficulty level
|
| 32 |
+
- `test_data/`: Test data with the same structure as training data
|
| 33 |
+
- `test_data_merge/`: Merged test data organized by game type (without difficulty separation)
|
| 34 |
+
|
| 35 |
+
### Game Types
|
| 36 |
+
|
| 37 |
+
1. **Maze**: Classic 2D maze navigation
|
| 38 |
+
2. **Irregular Maze**: Non-standard maze layouts
|
| 39 |
+
3. **Maze3D**: Three-dimensional maze navigation
|
| 40 |
+
4. **Sokoban**: Box-pushing puzzle game
|
| 41 |
+
5. **Trapfield**: Navigation with obstacles and traps
|
| 42 |
+
|
| 43 |
+
### Difficulty Levels
|
| 44 |
+
|
| 45 |
+
Each game type has three difficulty levels:
|
| 46 |
+
- `easy`: Simple layouts with shorter solution paths
|
| 47 |
+
- `medium`: Moderate complexity
|
| 48 |
+
- `hard`: Complex layouts requiring advanced planning
|
| 49 |
+
|
| 50 |
+
## File Format
|
| 51 |
+
|
| 52 |
+
Each data sample consists of:
|
| 53 |
+
- **Video file** (`.mp4`): Demonstration of gameplay
|
| 54 |
+
- **Image file** (`.png`): Initial state screenshot
|
| 55 |
+
- **JSON file** (`.json`): Game state metadata including:
|
| 56 |
+
- Grid layout and dimensions
|
| 57 |
+
- Entity positions (player, goal, boxes)
|
| 58 |
+
- Bounding box information
|
| 59 |
+
- Render parameters
|
| 60 |
+
|
| 61 |
+
### JSON Structure
|
| 62 |
+
|
| 63 |
+
```json
|
| 64 |
+
{
|
| 65 |
+
"version": "1.0",
|
| 66 |
+
"game_type": "maze",
|
| 67 |
+
"entities": {
|
| 68 |
+
"player": {
|
| 69 |
+
"pixel_pos": {"x": 165, "y": 45},
|
| 70 |
+
"bbox": {"x": 150, "y": 30, "width": 30, "height": 30},
|
| 71 |
+
"grid_pos": {"row": 1, "col": 5}
|
| 72 |
+
},
|
| 73 |
+
"goal": {
|
| 74 |
+
"pixel_pos": {"x": 105, "y": 165},
|
| 75 |
+
"bbox": {"x": 90, "y": 150, "width": 30, "height": 30},
|
| 76 |
+
"grid_pos": {"row": 5, "col": 3}
|
| 77 |
+
}
|
| 78 |
+
},
|
| 79 |
+
"grid": {
|
| 80 |
+
"data": [[1,1,1,...], [1,0,0,...], ...],
|
| 81 |
+
"height": 7,
|
| 82 |
+
"width": 7
|
| 83 |
+
},
|
| 84 |
+
"render": {
|
| 85 |
+
"cell_size": 30,
|
| 86 |
+
"image_width": 210,
|
| 87 |
+
"image_height": 210
|
| 88 |
+
}
|
| 89 |
+
}
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
### Metadata CSV
|
| 93 |
+
|
| 94 |
+
Each subdirectory contains a `metadata.csv` file with columns:
|
| 95 |
+
- `video`: Video filename
|
| 96 |
+
- `prompt`: Associated text prompt (currently empty)
|
| 97 |
+
- `input_image`: Initial state image filename
|
| 98 |
+
|
| 99 |
+
## Usage
|
| 100 |
+
|
| 101 |
+
This dataset can be used for:
|
| 102 |
+
- **Visual Planning**: Learning to plan sequences of actions from visual input
|
| 103 |
+
- **Multimodal Learning**: Combining video, image, and structured data
|
| 104 |
+
- **Reinforcement Learning**: Training agents on game environments
|
| 105 |
+
- **Video Understanding**: Learning temporal patterns in sequential decision-making
|
| 106 |
+
|
| 107 |
+
## Dataset Statistics
|
| 108 |
+
|
| 109 |
+
- **Total Games**: 5 game types
|
| 110 |
+
- **Difficulty Levels**: 3 per game type
|
| 111 |
+
- **Data Splits**: Training and test sets
|
| 112 |
+
- **File Types**: Video (.mp4), Images (.png), Metadata (.json), Index (.csv)
|
| 113 |
+
|
| 114 |
+
## Citation
|
| 115 |
+
|
| 116 |
+
If you use this dataset in your research, please cite:
|
| 117 |
+
|
| 118 |
+
```bibtex
|
| 119 |
+
@dataset{vr_bench_2025,
|
| 120 |
+
title={VR-Bench: A Multimodal Video Reasoning Benchmark},
|
| 121 |
+
author={[Author Name]},
|
| 122 |
+
year={2025},
|
| 123 |
+
url={https://huggingface.co/datasets/[username]/VR-Bench}
|
| 124 |
+
}
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## License
|
| 128 |
+
|
| 129 |
+
This dataset is released under the MIT License.
|