HY-Wan
/

Wan-R1

+---
+license: mit
+task_categories:
+- video-classification
+- reinforcement-learning
+- robotics
+language:
+- en
+tags:
+- games
+- maze
+- sokoban
+- 3d-navigation
+- multimodal
+- video
+- planning
+size_categories:
+- 10K<n<100K
+---
+# VR-Bench: A Multimodal Video Reasoning Benchmark
+## Dataset Description
+This is a multimodal dataset containing video demonstrations of game-playing scenarios across different game types including mazes, 3D mazes, Sokoban puzzles, and trap fields. The dataset is designed for training AI models on visual reasoning, planning, and sequential decision-making tasks.
+## Dataset Structure
+The dataset is organized into three main directories:
+- `train_data/`: Training data with subdirectories for each game type and difficulty level
+- `test_data/`: Test data with the same structure as training data
+- `test_data_merge/`: Merged test data organized by game type (without difficulty separation)
+### Game Types
+1. **Maze**: Classic 2D maze navigation
+2. **Irregular Maze**: Non-standard maze layouts
+3. **Maze3D**: Three-dimensional maze navigation
+4. **Sokoban**: Box-pushing puzzle game
+5. **Trapfield**: Navigation with obstacles and traps
+### Difficulty Levels
+Each game type has three difficulty levels:
+- `easy`: Simple layouts with shorter solution paths
+- `medium`: Moderate complexity
+- `hard`: Complex layouts requiring advanced planning
+## File Format
+Each data sample consists of:
+- **Video file** (`.mp4`): Demonstration of gameplay
+- **Image file** (`.png`): Initial state screenshot
+- **JSON file** (`.json`): Game state metadata including:
+  - Grid layout and dimensions
+  - Entity positions (player, goal, boxes)
+  - Bounding box information
+  - Render parameters
+### JSON Structure
+```json
+{
+  "version": "1.0",
+  "game_type": "maze",
+  "entities": {
+    "player": {
+      "pixel_pos": {"x": 165, "y": 45},
+      "bbox": {"x": 150, "y": 30, "width": 30, "height": 30},
+      "grid_pos": {"row": 1, "col": 5}
+    },
+    "goal": {
+      "pixel_pos": {"x": 105, "y": 165},
+      "bbox": {"x": 90, "y": 150, "width": 30, "height": 30},
+      "grid_pos": {"row": 5, "col": 3}
+    }
+  },
+  "grid": {
+    "data": [[1,1,1,...], [1,0,0,...], ...],
+    "height": 7,
+    "width": 7
+  },
+  "render": {
+    "cell_size": 30,
+    "image_width": 210,
+    "image_height": 210
+  }
+}
+```
+### Metadata CSV
+Each subdirectory contains a `metadata.csv` file with columns:
+- `video`: Video filename
+- `prompt`: Associated text prompt (currently empty)
+- `input_image`: Initial state image filename
+## Usage
+This dataset can be used for:
+- **Visual Planning**: Learning to plan sequences of actions from visual input
+- **Multimodal Learning**: Combining video, image, and structured data
+- **Reinforcement Learning**: Training agents on game environments
+- **Video Understanding**: Learning temporal patterns in sequential decision-making
+## Dataset Statistics
+- **Total Games**: 5 game types
+- **Difficulty Levels**: 3 per game type
+- **Data Splits**: Training and test sets
+- **File Types**: Video (.mp4), Images (.png), Metadata (.json), Index (.csv)
+## Citation
+If you use this dataset in your research, please cite:
+```bibtex
+@dataset{vr_bench_2025,
+  title={VR-Bench: A Multimodal Video Reasoning Benchmark},
+  author={[Author Name]},
+  year={2025},
+  url={https://huggingface.co/datasets/[username]/VR-Bench}
+}
+```
+## License
+This dataset is released under the MIT License.