Any-to-Any
Transformers
Safetensors
chameleon
image-to-text
multimodal
reasoning
sft
rl
charlesdj commited on
Commit
13e998e
·
verified ·
1 Parent(s): fbe6ef0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -8
README.md CHANGED
@@ -10,20 +10,73 @@ datasets:
10
  - ModalityDance/Omni-Bench
11
  base_model:
12
  - GAIR/Anole-7b-v0.1
13
- pipeline_tag: any-to-any
14
  ---
15
 
16
  # Omni-R1
17
 
18
- Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for stable functional image generation, then PeRPO for RL refinement on unified tasks.
 
 
19
 
20
- <p align="center">
21
- <a href="https://arxiv.org/abs/2601.09536"><b>Paper</b>👁️</a> ·
22
- <a href="https://github.com/ModalityDance/Omni-R1"><b>Code</b>🐙</a> ·
23
- <a href="https://huggingface.co/datasets/ModalityDance/Omni-Bench"><b>Omni-Bench</b>🧪</a>
24
- </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ## Citation
 
27
  ```bibtex
28
  @misc{cheng2026omnir1unifiedgenerativeparadigm,
29
  title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
@@ -34,4 +87,4 @@ Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for st
34
  primaryClass={cs.AI},
35
  url={https://arxiv.org/abs/2601.09536},
36
  }
37
- ```
 
10
  - ModalityDance/Omni-Bench
11
  base_model:
12
  - GAIR/Anole-7b-v0.1
 
13
  ---
14
 
15
  # Omni-R1
16
 
17
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2601.09536)
18
+ [![Code](https://img.shields.io/badge/GitHub-Code-blue?style=for-the-badge&logo=github)](https://github.com/ModalityDance/Omni-R1)
19
+ [![Omni-Bench](https://img.shields.io/badge/Dataset-Omni--Bench-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/ModalityDance/Omni-Bench)
20
 
21
+ ## Overview
22
+
23
+ **Omni-R1** is trained with multimodal interleaved supervision. It uses **PeSFT** for stable functional image generation, then **PeRPO** for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.
24
+
25
+ ## Usage
26
+
27
+ ```python
28
+ import torch
29
+ from PIL import Image
30
+ from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
31
+
32
+ # 1) Import & load
33
+ model_id = "ModalityDance/Omni-R1" # or a local checkpoint path
34
+ processor = ChameleonProcessor.from_pretrained(model_id)
35
+ model = ChameleonForConditionalGeneration.from_pretrained(
36
+ model_id,
37
+ torch_dtype=torch.bfloat16,
38
+ device_map="auto",
39
+ )
40
+ model.eval()
41
+
42
+ # 2) Prepare a single input
43
+ prompt = "What is the smiling man in the image wearing? <image>"
44
+ image = Image.open("image.png").convert("RGB")
45
+
46
+ inputs = processor(
47
+ prompt,
48
+ images=[image],
49
+ padding=False,
50
+ return_for_text_completion=True,
51
+ return_tensors="pt",
52
+ ).to(model.device)
53
+
54
+ # 3) Call the model
55
+ outputs = model.generate(
56
+ **inputs,
57
+ max_length=4096,
58
+ do_sample=True,
59
+ temperature=0.5,
60
+ top_p=0.9,
61
+ pad_token_id=1,
62
+ multimodal_generation_mode="unrestricted",
63
+ )
64
+
65
+ # 4) Get results
66
+ text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
67
+ print(text)
68
+ ```
69
+
70
+ For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
71
+ https://github.com/ModalityDance/Omni-R1
72
+
73
+ ## License
74
+
75
+ This project is licensed under the **MIT License**.
76
+ It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**.
77
 
78
  ## Citation
79
+
80
  ```bibtex
81
  @misc{cheng2026omnir1unifiedgenerativeparadigm,
82
  title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
 
87
  primaryClass={cs.AI},
88
  url={https://arxiv.org/abs/2601.09536},
89
  }
90
+ ```