ModalityDance
/

Omni-R1

@@ -10,20 +10,73 @@ datasets:
 - ModalityDance/Omni-Bench
 base_model:
 - GAIR/Anole-7b-v0.1
-pipeline_tag: any-to-any
 ---
 # Omni-R1
-Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for stable functional image generation, then PeRPO for RL refinement on unified tasks.
-<p align="center">
-  <a href="https://arxiv.org/abs/2601.09536"><b>Paper</b>👁️</a> ·
-  <a href="https://github.com/ModalityDance/Omni-R1"><b>Code</b>🐙</a> ·
-  <a href="https://huggingface.co/datasets/ModalityDance/Omni-Bench"><b>Omni-Bench</b>🧪</a>
-</p>
 ## Citation
 ```bibtex
 @misc{cheng2026omnir1unifiedgenerativeparadigm,
       title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
@@ -34,4 +87,4 @@ Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for st
       primaryClass={cs.AI},
       url={https://arxiv.org/abs/2601.09536},
 }
-```

 - ModalityDance/Omni-Bench
 base_model:
 - GAIR/Anole-7b-v0.1
 ---
 # Omni-R1
+[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2601.09536)
+[![Code](https://img.shields.io/badge/GitHub-Code-blue?style=for-the-badge&logo=github)](https://github.com/ModalityDance/Omni-R1)
+[![Omni-Bench](https://img.shields.io/badge/Dataset-Omni--Bench-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/ModalityDance/Omni-Bench)
+## Overview
+**Omni-R1** is trained with multimodal interleaved supervision. It uses **PeSFT** for stable functional image generation, then **PeRPO** for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.
+## Usage
+```python
+import torch
+from PIL import Image
+from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
+# 1) Import & load
+model_id = "ModalityDance/Omni-R1"  # or a local checkpoint path
+processor = ChameleonProcessor.from_pretrained(model_id)
+model = ChameleonForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+model.eval()
+# 2) Prepare a single input
+prompt = "What is the smiling man in the image wearing? <image>"
+image = Image.open("image.png").convert("RGB")
+inputs = processor(
+    prompt,
+    images=[image],
+    padding=False,
+    return_for_text_completion=True,
+    return_tensors="pt",
+).to(model.device)
+# 3) Call the model
+outputs = model.generate(
+    **inputs,
+    max_length=4096,
+    do_sample=True,
+    temperature=0.5,
+    top_p=0.9,
+    pad_token_id=1,
+    multimodal_generation_mode="unrestricted",
+)
+# 4) Get results
+text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
+print(text)
+```
+For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
+https://github.com/ModalityDance/Omni-R1
+## License
+This project is licensed under the **MIT License**.
+It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**.
 ## Citation
 ```bibtex
 @misc{cheng2026omnir1unifiedgenerativeparadigm,
       title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
       primaryClass={cs.AI},
       url={https://arxiv.org/abs/2601.09536},
 }
+```