ModalityDance
/

Omni-R1-Zero

@@ -11,20 +11,71 @@ datasets:
 base_model:
 - GAIR/Anole-7b-v0.1
 license: mit
-pipeline_tag: any-to-any
 ---
 # Omni-R1-Zero
-Omni-R1-Zero is trained without multimodal annotations. It bootstraps step-wise visualizations from text-only CoT seeds, then follows the SFT→RL recipe to learn interleaved multimodal reasoning.
-<p align="center">
-  <a href="https://arxiv.org/abs/2601.09536"><b>Paper</b>👁️</a> ·
-  <a href="https://github.com/ModalityDance/Omni-R1"><b>Code</b>🐙</a> ·
-  <a href="https://huggingface.co/datasets/ModalityDance/Omni-Bench"><b>Omni-Bench</b>🧪</a>
-</p>
 ## Citation
 ```bibtex
 @misc{cheng2026omnir1unifiedgenerativeparadigm,
       title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
@@ -35,4 +86,4 @@ Omni-R1-Zero is trained without multimodal annotations. It bootstraps step-wise
       primaryClass={cs.AI},
       url={https://arxiv.org/abs/2601.09536},
 }
-```

 base_model:
 - GAIR/Anole-7b-v0.1
 license: mit
 ---
 # Omni-R1-Zero
+[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2601.09536)
+[![Code](https://img.shields.io/badge/GitHub-Code-blue?style=for-the-badge&logo=github)](https://github.com/ModalityDance/Omni-R1)
+[![Omni-Bench](https://img.shields.io/badge/Dataset-Omni--Bench-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/ModalityDance/Omni-Bench)
+## Overview
+**Omni-R1-Zero** is trained **without multimodal annotations**. It bootstraps **step-wise visualizations** from **text-only CoT seeds** (e.g., M3CoT), and then follows the same **SFT → RL** recipe as Omni-R1 to learn interleaved multimodal reasoning.
+## Usage
+```python
+import torch
+from PIL import Image
+from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
+# 1) Import & load
+model_id = "ModalityDance/Omni-R1-Zero"  # or a local checkpoint path
+processor = ChameleonProcessor.from_pretrained(model_id)
+model = ChameleonForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+model.eval()
+# 2) Prepare a single input
+prompt = "You are a helpful assistant.\nUser: Which of these would appear shinier when polished? A. Metal spoon B. Wooden spoon\nThink with images first, the image reasoning process and answer are enclosed within <reserved12856> <reserved12857> and <reserved12866> <reserved12867> XML tags, respectively.\nAssistant:"
+inputs = processor(
+    prompt,
+    padding=False,
+    return_for_text_completion=True,
+    return_tensors="pt",
+).to(model.device)
+# 3) Call the model
+outputs = model.generate(
+    **inputs,
+    max_length=4096,
+    do_sample=True,
+    temperature=1.0,
+    top_p=0.9,
+    pad_token_id=1,
+    multimodal_generation_mode="unrestricted",
+)
+# 4) Get results
+text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
+print(text)
+```
+For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
+https://github.com/ModalityDance/Omni-R1
+## License
+This project is licensed under the **MIT License**.
+It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**.
 ## Citation
 ```bibtex
 @misc{cheng2026omnir1unifiedgenerativeparadigm,
       title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
       primaryClass={cs.AI},
       url={https://arxiv.org/abs/2601.09536},
 }
+```