Lekr0 commited on
Commit
b6fb2b0
·
verified ·
1 Parent(s): 8765573

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. ICL/DAPO/verl-recipe/.pre-commit-config.yaml +8 -0
  2. ICL/DAPO/verl-recipe/CODEOWNERS +0 -0
  3. ICL/DAPO/verl-recipe/LICENSE +202 -0
  4. ICL/DAPO/verl-recipe/README.md +47 -0
  5. ICL/DAPO/verl-recipe/collabllm/collabllm_agent_loop.py +139 -0
  6. ICL/DAPO/verl-recipe/collabllm/collabllm_interation.py +374 -0
  7. ICL/DAPO/verl-recipe/collabllm/metrics/accuracy.py +104 -0
  8. ICL/DAPO/verl-recipe/collabllm/metrics/bleu_score.py +115 -0
  9. ICL/DAPO/verl-recipe/collabllm/metrics/interactivity.py +108 -0
  10. ICL/DAPO/verl-recipe/collabllm/metrics/pass_rate.py +138 -0
  11. ICL/DAPO/verl-recipe/collabllm/process_dataset.py +239 -0
  12. ICL/DAPO/verl-recipe/collabllm/reward_function.py +227 -0
  13. ICL/DAPO/verl-recipe/collabllm/train_rl_collabllm.sh +76 -0
  14. ICL/DAPO/verl-recipe/collabllm/train_sft_collabllm.sh +32 -0
  15. ICL/DAPO/verl-recipe/dapo/README.md +192 -0
  16. ICL/DAPO/verl-recipe/dapo/dapo_ray_trainer.py +418 -0
  17. ICL/DAPO/verl-recipe/dapo/main_dapo.py +185 -0
  18. ICL/DAPO/verl-recipe/dapo/prepare_dapo_data.sh +17 -0
  19. ICL/DAPO/verl-recipe/dapo/run dapo_qwen2.5_vl_32b_fsdp2_npu.sh +151 -0
  20. ICL/DAPO/verl-recipe/dapo/run dapo_qwen2.5_vl_3b_fsdp2_npu.sh +154 -0
  21. ICL/DAPO/verl-recipe/dapo/run dapo_qwen2.5_vl_7b_fsdp2_npu.sh +153 -0
  22. ICL/DAPO/verl-recipe/dapo/run dapo_qwen3_vl_30b_fsdp2_npu.sh +152 -0
  23. ICL/DAPO/verl-recipe/dapo/run_dapo_early_qwen2.5_32b.sh +129 -0
  24. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b.sh +131 -0
  25. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b_fsdp2_20k_npu.sh +151 -0
  26. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b_fsdp2_4k_npu.sh +155 -0
  27. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b_npu.sh +140 -0
  28. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b_rollout_corr.sh +176 -0
  29. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_7b_npu.sh +142 -0
  30. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_14b_base_npu.sh +139 -0
  31. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_30b_fsdp_6k_npu.sh +161 -0
  32. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_moe_30b_base_fsdp_npu.sh +143 -0
  33. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_moe_30b_megatron_npu.sh +170 -0
  34. ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_moe_30b_vllm_fp8_rollout.sh +171 -0
  35. ICL/DAPO/verl-recipe/dapo/run_dapo_wo_ds_qwen2.5_32b.sh +126 -0
  36. ICL/DAPO/verl-recipe/dapo/runtime_env.yaml +5 -0
  37. ICL/DAPO/verl-recipe/dapo/test_dapo_7b.sh +131 -0
  38. ICL/DAPO/verl-recipe/dapo/test_dapo_7b_math.sh +131 -0
  39. ICL/DAPO/verl-recipe/dapo/test_dapo_7b_math_lora.sh +131 -0
  40. ICL/DAPO/verl-recipe/dapo/test_dapo_7b_math_megatron.sh +132 -0
  41. ICL/DAPO/verl-recipe/dapo/test_dapo_8b_megatron_fp16.sh +142 -0
  42. ICL/DAPO/verl-recipe/dapo/test_dapo_8b_megatron_fp8train.sh +201 -0
  43. ICL/DAPO/verl-recipe/dapo/test_dapo_dspk_671b_megatron_96gb.sh +143 -0
  44. ICL/DAPO/verl-recipe/dapo/test_dapo_glm_air_megatron.sh +197 -0
  45. ICL/DAPO/verl-recipe/dapo/test_dapo_gptoss_20b_megatron.sh +248 -0
  46. ICL/DAPO/verl-recipe/dapo/test_dapo_qwen3_30b_math.sh +127 -0
  47. ICL/DAPO/verl-recipe/dapo/test_dapo_qwen3_30b_math_single_node.sh +127 -0
  48. ICL/DAPO/verl-recipe/dapo/test_dapo_qwen3_moe_30b_megatron_fp16.sh +148 -0
  49. ICL/DAPO/verl-recipe/dapo/test_dapo_qwen3next_80b_megatron.sh +232 -0
  50. ICL/DAPO/verl-recipe/deepeyes/README.md +49 -0
ICL/DAPO/verl-recipe/.pre-commit-config.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ repos:
2
+ - repo: https://github.com/astral-sh/ruff-pre-commit
3
+ rev: "v0.14.10"
4
+ hooks:
5
+ - id: ruff
6
+ args: ["--fix", "--show-fixes", "--output-format=full"]
7
+ exclude: ^.*\.(ipynb)$
8
+ - id: ruff-format
ICL/DAPO/verl-recipe/CODEOWNERS ADDED
File without changes
ICL/DAPO/verl-recipe/LICENSE ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ Apache License
3
+ Version 2.0, January 2004
4
+ http://www.apache.org/licenses/
5
+
6
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
+
8
+ 1. Definitions.
9
+
10
+ "License" shall mean the terms and conditions for use, reproduction,
11
+ and distribution as defined by Sections 1 through 9 of this document.
12
+
13
+ "Licensor" shall mean the copyright owner or entity authorized by
14
+ the copyright owner that is granting the License.
15
+
16
+ "Legal Entity" shall mean the union of the acting entity and all
17
+ other entities that control, are controlled by, or are under common
18
+ control with that entity. For the purposes of this definition,
19
+ "control" means (i) the power, direct or indirect, to cause the
20
+ direction or management of such entity, whether by contract or
21
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or (iii) beneficial ownership of such entity.
23
+
24
+ "You" (or "Your") shall mean an individual or Legal Entity
25
+ exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications,
28
+ including but not limited to software source code, documentation
29
+ source, and configuration files.
30
+
31
+ "Object" form shall mean any form resulting from mechanical
32
+ transformation or translation of a Source form, including but
33
+ not limited to compiled object code, generated documentation,
34
+ and conversions to other media types.
35
+
36
+ "Work" shall mean the work of authorship, whether in Source or
37
+ Object form, made available under the License, as indicated by a
38
+ copyright notice that is included in or attached to the work
39
+ (an example is provided in the Appendix below).
40
+
41
+ "Derivative Works" shall mean any work, whether in Source or Object
42
+ form, that is based on (or derived from) the Work and for which the
43
+ editorial revisions, annotations, elaborations, or other modifications
44
+ represent, as a whole, an original work of authorship. For the purposes
45
+ of this License, Derivative Works shall not include works that remain
46
+ separable from, or merely link (or bind by name) to the interfaces of,
47
+ the Work and Derivative Works thereof.
48
+
49
+ "Contribution" shall mean any work of authorship, including
50
+ the original version of the Work and any modifications or additions
51
+ to that Work or Derivative Works thereof, that is intentionally
52
+ submitted to Licensor for inclusion in the Work by the copyright owner
53
+ or by an individual or Legal Entity authorized to submit on behalf of
54
+ the copyright owner. For the purposes of this definition, "submitted"
55
+ means any form of electronic, verbal, or written communication sent
56
+ to the Licensor or its representatives, including but not limited to
57
+ communication on electronic mailing lists, source code control systems,
58
+ and issue tracking systems that are managed by, or on behalf of, the
59
+ Licensor for the purpose of discussing and improving the Work, but
60
+ excluding communication that is conspicuously marked or otherwise
61
+ designated in writing by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have made,
78
+ use, offer to sell, sell, import, and otherwise transfer the Work,
79
+ where such license applies only to those patent claims licensable
80
+ by such Contributor that are necessarily infringed by their
81
+ Contribution(s) alone or by combination of their Contribution(s)
82
+ with the Work to which such Contribution(s) was submitted. If You
83
+ institute patent litigation against any entity (including a
84
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
85
+ or a Contribution incorporated within the Work constitutes direct
86
+ or contributory patent infringement, then any patent licenses
87
+ granted to You under this License for that Work shall terminate
88
+ as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing the
142
+ origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright [yyyy] [name of copyright owner]
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
ICL/DAPO/verl-recipe/README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # verl-recipe
2
+
3
+ `verl-recipe` hosts recipes based on [verl](https://github.com/volcengine/verl) contributed by the community.
4
+
5
+ ## Usage
6
+
7
+ `verl-recipe` can be used as a submodule of `verl`, keeping backward compatibility as `verl/recipe`:
8
+
9
+ ```bash
10
+ git clone https://github.com/verl-project/verl.git
11
+ cd verl
12
+ git submodule update --init --recursive recipe
13
+ ```
14
+
15
+ ## Available Recipes
16
+
17
+ - [retool](https://github.com/verl-project/verl-recipe/tree/main/retool): Reinforcement Learning for Strategic Tool Use in LLMs
18
+ - [langgraph_agent](https://github.com/verl-project/verl-recipe/tree/main/langgraph_agent): A tiny example to demonstrate multi-turn rollout with [LangGraph ReactAgent](https://langchain-ai.github.io/langgraph/agents/overview/) to solve math expression.
19
+ - [spo](https://github.com/verl-project/verl-recipe/tree/main/spo): [Single-stream Policy Optimization](https://arxiv.org/abs/2509.13232).
20
+ - TBA...
21
+
22
+ ## Contribution
23
+
24
+ ### Version Specification
25
+
26
+ Recipes are recommended to specify the verl version required, e.g.,
27
+
28
+ ```
29
+ # release version
30
+ verl==0.6.0
31
+
32
+ # dev version
33
+ verl@git+https://github.com/volcengine/verl.git@313dfdb2199124a37189e32e6d4a6c654379f2d4
34
+ ```
35
+
36
+ ### Code Linting and Formatting
37
+
38
+ To maximize flexiblility but minimize meaningless changes, we apply `pre-commit` but only force code linting and formatting with `ruff`. Use it as follows:
39
+
40
+ ```bash
41
+ pip install pre-commit
42
+ pre-commit install
43
+ # for staged changes
44
+ pre-commit run
45
+ # for all files in the repo
46
+ pre-commit run --all-files
47
+ ```
ICL/DAPO/verl-recipe/collabllm/collabllm_agent_loop.py ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2025 CollabLLM team and/or its affiliates
2
+ # Copyright 2025 Bytedance Ltd. and/or its affiliates
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import logging
17
+ import os
18
+ from copy import deepcopy
19
+ from typing import Any
20
+ from uuid import uuid4
21
+
22
+ from recipe.collabllm.utils import is_valid_messages
23
+
24
+ from verl.experimental.agent_loop.agent_loop import AgentLoopOutput
25
+ from verl.experimental.agent_loop.tool_agent_loop import AgentData, AgentState, ToolAgentLoop
26
+ from verl.utils.rollout_trace import rollout_trace_op
27
+ from verl.workers.rollout.schemas import Message
28
+
29
+ logger = logging.getLogger(__file__)
30
+ logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
31
+
32
+
33
+ class CollabLLMAgentLoop(ToolAgentLoop):
34
+ @rollout_trace_op
35
+ async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput:
36
+ messages = list(kwargs["raw_prompt"])
37
+ image_data = deepcopy(kwargs.get("multi_modal_data", {}).get("image", None))
38
+ metrics = {}
39
+ request_id = uuid4().hex
40
+ tools_kwargs = kwargs.get("tools_kwargs", {})
41
+
42
+ # Initialize interaction if needed
43
+ interaction = None
44
+ interaction_kwargs = {}
45
+ if self.interaction_config_file:
46
+ interaction_kwargs = kwargs["extra_info"]["interaction_kwargs"]
47
+ if "name" not in interaction_kwargs:
48
+ raise ValueError("'name' key is required in interaction_kwargs")
49
+ interaction_name = interaction_kwargs["name"]
50
+ if interaction_name not in self.interaction_map:
51
+ raise ValueError(
52
+ f"Interaction '{interaction_name}' not found in interaction_map. Available interactions: "
53
+ f"{list(self.interaction_map.keys())}"
54
+ )
55
+ interaction = self.interaction_map[interaction_name]
56
+ await interaction.start_interaction(request_id, **interaction_kwargs)
57
+ # Create AgentData instance to encapsulate all state
58
+ agent_data = AgentData(
59
+ messages=messages,
60
+ image_data=image_data,
61
+ metrics=metrics,
62
+ request_id=request_id,
63
+ tools_kwargs=tools_kwargs,
64
+ interaction=interaction,
65
+ interaction_kwargs=interaction_kwargs,
66
+ )
67
+ # for collabllm, firstly generate model reponses
68
+ await self._handle_pending_state(agent_data, sampling_params)
69
+
70
+ status = await self._handle_generating_state(agent_data, sampling_params)
71
+
72
+ if status == AgentState.TERMINATED:
73
+ # tell reward manager to score -1 and skip future interaction
74
+ # to avoid reward hacking with incompleted message
75
+ num_repeats = 0
76
+ else:
77
+ # then, collect interaction rollouts
78
+ num_repeats = self.config.actor_rollout_ref.rollout.multi_turn.num_repeat_rollouts
79
+
80
+ interaction_requests = [deepcopy(agent_data) for _ in range(num_repeats)]
81
+
82
+ # messages are only used in collabllm reward manager
83
+ messages_lst = []
84
+ for _agent_data in interaction_requests:
85
+ if not is_valid_messages(_agent_data.messages[-1]):
86
+ break
87
+
88
+ prev_msg_len = len(_agent_data.messages)
89
+ await self.run_agent_data_loop(_agent_data, sampling_params, AgentState.INTERACTING)
90
+ messages_lst.append([Message(**msg) for msg in _agent_data.messages])
91
+
92
+ if interaction.config.get("enable_log"):
93
+ print(f"Assistant: ...{messages_lst[-1][prev_msg_len - 1].content[-100:]}")
94
+ print(f"User: {messages_lst[-1][prev_msg_len].content[:100]}...")
95
+
96
+ # Finalize output
97
+ response_ids = agent_data.prompt_ids[-len(agent_data.response_mask) :]
98
+ prompt_ids = agent_data.prompt_ids[: len(agent_data.prompt_ids) - len(agent_data.response_mask)]
99
+ multi_modal_data = {"image": agent_data.image_data} if agent_data.image_data is not None else {}
100
+
101
+ output = AgentLoopOutput(
102
+ prompt_ids=prompt_ids,
103
+ response_ids=response_ids[: self.response_length],
104
+ response_mask=agent_data.response_mask[: self.response_length],
105
+ multi_modal_data=multi_modal_data,
106
+ response_logprobs=agent_data.response_logprobs[: self.response_length]
107
+ if agent_data.response_logprobs
108
+ else None,
109
+ num_turns=agent_data.user_turns + agent_data.assistant_turns + 1,
110
+ metrics=agent_data.metrics,
111
+ extra_fields={
112
+ "turn_scores": agent_data.turn_scores,
113
+ "messages": {"messages": messages_lst}, # compatiable with sglang interaction
114
+ },
115
+ )
116
+ return output
117
+
118
+ async def run_agent_data_loop(self, agent_data: AgentData, sampling_params: dict[str, Any], state: AgentState):
119
+ """
120
+ Run the agent data loop to process the agent data.
121
+
122
+ Args:
123
+ agent_data (AgentData): The agent data to process.
124
+ sampling_params (dict[str, Any]): The sampling parameters.
125
+ state (AgentState, optional): The initial state of the agent. Defaults to None.
126
+ """
127
+
128
+ while state != AgentState.TERMINATED:
129
+ if state == AgentState.PENDING:
130
+ state = await self._handle_pending_state(agent_data, sampling_params)
131
+ elif state == AgentState.GENERATING:
132
+ state = await self._handle_generating_state(agent_data, sampling_params)
133
+ elif state == AgentState.PROCESSING_TOOLS:
134
+ state = await self._handle_processing_tools_state(agent_data)
135
+ elif state == AgentState.INTERACTING:
136
+ state = await self._handle_interacting_state(agent_data)
137
+ else:
138
+ logger.error(f"Invalid state: {state}")
139
+ state = AgentState.TERMINATED
ICL/DAPO/verl-recipe/collabllm/collabllm_interation.py ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2024 CollabLLM Ltd. and/or its affiliates
2
+ # Copyright 2024 Bytedance Ltd. and/or its affiliates
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import asyncio
17
+ import copy
18
+ import logging
19
+ import os
20
+ from typing import Any, Optional
21
+ from uuid import uuid4
22
+
23
+ from recipe.collabllm.utils import remove_think_block
24
+
25
+ from verl.interactions.base import BaseInteraction
26
+ from verl.utils.rollout_trace import rollout_trace_op
27
+
28
+ logger = logging.getLogger(__name__)
29
+ logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
30
+
31
+ TERMINATION_SIGNAL = "[[TERMINATE CHAT]]"
32
+ USER_PROMPT_TEMPLATE = """You are role-playing as a human USER interacting with an AI collaborator to complete a specific task. Your goal is to generate realistic, natural responses that a user might give in this scenario.
33
+
34
+ ## Input Information:
35
+ You will be provided with:
36
+ - Task Description: The type of task you are trying to accomplish.
37
+ - Complete Prompt or Reference Goal: This field may include the complete user request/query or a reference answer to user's request. Use this field to understand the user's intent, requirements, or what would count as a satisfactory outcome.
38
+ - Chat History: The ongoing conversation between you (as the user) and the AI
39
+
40
+ Inputs:
41
+ <|The Start of Task Description (Not visible to the AI)|>
42
+ {task_desc}
43
+ <|The End of Task Description|>
44
+
45
+ <|The Start of Complete Prompt or Reference Goal (Not visible to the AI)|>
46
+ {single_turn_prompt}
47
+ <|The End of Complete Prompt or Reference Goal|>
48
+
49
+ <|The Start of Chat History|>
50
+ {chat_history}
51
+ <|The End of Chat History|>
52
+
53
+
54
+ ## Guidelines:
55
+ - Stay in Character: Role-play as a human USER. You are NOT an AI. Maintain a consistent personality throughout the chat.
56
+ - Minimize Effort: IMPORTANT! As a user, avoid being too detailed in your responses. Provide vague or incomplete demands in the early stages of the conversation to minimize your effort. Let the AI ask for clarification rather than providing everything upfront.
57
+ - Knowledge Background: Reflect the user's knowledge level in the role-playing. If the user is less knowledgeable about a task, they might not notice incorrect statements. Ask questions that demonstrate your current understanding and areas of confusion.
58
+ - Occasionally Make Mistakes: Real-world users might misspell words, provide incorrect dates, give wrong information, or ask unclear questions. Simulate this behavior to reflect natural interactions.
59
+ - Mention Personal Preferences: Include preferences or constraints that might influence your requests or responses. For example, "I prefer short answers," "I need this done quickly," or "I like detailed comments in code."
60
+ - Goal-Oriented: Keep the chat focused on your intent. Avoid small talk or digressions. Redirect the chat back to the main objective if it starts to stray.
61
+
62
+ ## Output Format:
63
+ You should output a JSON object with three entries:
64
+ - "current_answer" (str): Briefly summerize the AI's current solution to the task.
65
+ - "thought" (str): Output your thought process as a user deciding what to say next. Consider:
66
+ 1. Have you obtained a satisfactory solution from the AI? If yes, you can terminate this chat.
67
+ 2. If not, what specific part of the problem or solution are you struggling with?
68
+ 3. Has the AI asked you to perform a task or answer a question? If so, how should you approach it?
69
+ 4. Are you noticing any patterns or potential misunderstandings that need clarification?
70
+ 5. If you're stuck, how can you phrase your question to get the most helpful response while demonstrating your current understanding?
71
+ - "response" (str): Based on your thought process, respond to the AI as the user you are role-playing. Stop immediately when the user's response is completed.
72
+
73
+ ## Important Notes:
74
+ - Respond Based on Previous Messages: Your responses should be based on the context of the current chat history. Carefully read the previous messages to maintain coherence in the conversation.
75
+ - Conversation Flow: If "Current Chat History" is empty, start the conversation from scratch with an initial request. Otherwise, continue based on the existing conversation.
76
+ - Don't Copy Input Directly: Use the provided information for understanding context only. Avoid copying target queries or any provided information directly in your responses.
77
+ - Completion Signal: Use "{termination_signal}" as your response when you believe your goal has been solved or if you determine the AI cannot help further.
78
+ - Double check if the JSON object is formatted correctly. Ensure that all fields are present and properly structured.
79
+
80
+ Remember to stay in character as a user throughout your response, and follow the instructions and guidelines carefully.""" # noqa: E501
81
+
82
+
83
+ class CollabLLMInteraction(BaseInteraction):
84
+ """A demo interaction for calculating the reward of CollabLLM.
85
+
86
+ - `start_interaction`: start a interaction instance for a trajectory.
87
+ - `generate_response`: generate the response of the assistant.
88
+ - `calculate_score`: calculate the score of the interaction.
89
+ - `finalize_interaction`: finalize the interaction instance.
90
+ """
91
+
92
+ def __init__(self, config: dict):
93
+ super().__init__(config)
94
+ _config = copy.deepcopy(config)
95
+
96
+ _config.pop("enable_log", None)
97
+
98
+ self.name = _config.pop("name")
99
+ self.user_model = _config.pop("user_model")
100
+
101
+ self.termination_signal = _config.pop("termination_signal", TERMINATION_SIGNAL)
102
+ self.num_retries = _config.pop("num_retries", 3)
103
+
104
+ self.user_model_kwargs = _config
105
+
106
+ self._instance_dict = {}
107
+
108
+ async def start_interaction(
109
+ self, instance_id: Optional[str] = None, ground_truth: Optional[str] = None, **kwargs
110
+ ) -> str:
111
+ if instance_id is None:
112
+ instance_id = str(uuid4())
113
+ self._instance_dict[instance_id] = {
114
+ "response": "",
115
+ "ground_truth": ground_truth,
116
+ "reward": 0.0,
117
+ }
118
+ self.interaction_kwargs = kwargs
119
+ assert "single_turn_prompt" in kwargs, "single_turn_prompt is required in interaction_kwargs"
120
+ return instance_id
121
+
122
+ @rollout_trace_op
123
+ async def generate_response(
124
+ self, instance_id: str, messages: list[dict[str, Any]], **kwargs
125
+ ) -> tuple[bool, str, float, dict]:
126
+ assert messages[-1]["role"] in ["system", "assistant"], (
127
+ "Last message input to the user model must be from system or assistant role"
128
+ )
129
+
130
+ import litellm
131
+
132
+ chat_history = self._parse_messages(messages, strip_sys_prompt=True)
133
+ prompt = USER_PROMPT_TEMPLATE.format(
134
+ task_desc=self.interaction_kwargs.get("task_desc", "general assistance task"),
135
+ single_turn_prompt=self.interaction_kwargs["single_turn_prompt"],
136
+ chat_history=chat_history,
137
+ termination_signal=self.termination_signal,
138
+ )
139
+ response = ""
140
+ for i in range(self.num_retries):
141
+ try:
142
+ full_response = (
143
+ (
144
+ await litellm.acompletion(
145
+ model=self.user_model,
146
+ messages=[{"role": "user", "content": prompt}],
147
+ **self.user_model_kwargs,
148
+ )
149
+ )
150
+ .choices[0]
151
+ .message.content
152
+ )
153
+ except litellm.RateLimitError as e:
154
+ logger.warning(f"[CollabLLMInteraction] hit RateLimitError: {e}. Retrying...")
155
+ await asyncio.sleep(max(2**i, 60))
156
+ continue
157
+ except Exception as e:
158
+ logger.exception(f"An unexpected error occurred in CollabLLMAgentLoop: {e}")
159
+ continue
160
+
161
+ try:
162
+ if isinstance(full_response, str):
163
+ full_response = extract_json(full_response)
164
+ except Exception as e:
165
+ logger.warning(f"[CollabLLMInteraction] Error extracting JSON: {e}. Retrying...")
166
+ continue
167
+
168
+ if isinstance(full_response, dict):
169
+ keys = full_response.keys()
170
+ if {"current_answer", "thought", "response"}.issubset(keys):
171
+ response = full_response.pop("response")
172
+ if isinstance(response, str):
173
+ break
174
+ else:
175
+ logger.warning(
176
+ f"[CollabLLMInteraction] got an invalid response {response} full_response {full_response}. \
177
+ Retrying..."
178
+ )
179
+ continue
180
+ else:
181
+ logger.warning(f"[CollabLLMInteraction] Keys {keys} do not match expected keys. Retrying...")
182
+ continue
183
+
184
+ self._instance_dict[instance_id]["response"] = response
185
+ logger.debug(f"[CollabLLMInteraction] User: {response}")
186
+ should_terminate_sequence = self.termination_signal in response
187
+ reward = 0.0
188
+
189
+ return should_terminate_sequence, response, reward, {}
190
+
191
+ async def finalize_interaction(self, instance_id: str, **kwargs) -> None:
192
+ del self._instance_dict[instance_id]
193
+
194
+ def _parse_messages(self, messages, strip_sys_prompt=True):
195
+ if messages is None:
196
+ return ""
197
+
198
+ if strip_sys_prompt:
199
+ messages = [msg for msg in messages if msg["role"] != "system"]
200
+
201
+ messages = [remove_think_block(msg) for msg in messages]
202
+
203
+ chat = "\n".join(f"**{m['role'].capitalize()}**: {m['content']}" for m in messages)
204
+
205
+ return chat
206
+
207
+
208
+ def extract_json(s):
209
+ def convert_value(value):
210
+ true_values = {"true": True, "false": False, "null": None}
211
+ value_lower = value.lower()
212
+ if value_lower in true_values:
213
+ return true_values[value_lower]
214
+ try:
215
+ if "." in value or "e" in value.lower():
216
+ return float(value)
217
+ else:
218
+ return int(value)
219
+ except ValueError:
220
+ return value # Return as string if not a number
221
+
222
+ def parse_number(s, pos):
223
+ start = pos
224
+ while pos < len(s) and s[pos] in "-+0123456789.eE":
225
+ pos += 1
226
+ num_str = s[start:pos]
227
+ try:
228
+ if "." in num_str or "e" in num_str.lower():
229
+ return float(num_str), pos
230
+ else:
231
+ return int(num_str), pos
232
+ except ValueError:
233
+ logger.error(f"Invalid number at position {start}: {num_str}")
234
+ raise
235
+
236
+ def skip_whitespace(s, pos):
237
+ while pos < len(s) and s[pos] in " \t\n\r":
238
+ pos += 1
239
+ return pos
240
+
241
+ def parse_string(s, pos):
242
+ quote_char = s[pos]
243
+ assert quote_char in ('"', "'")
244
+ pos += 1
245
+ result = ""
246
+ while pos < len(s):
247
+ c = s[pos]
248
+ if c == "\\":
249
+ pos += 1
250
+ if pos >= len(s):
251
+ raise ValueError("Invalid escape sequence")
252
+ c = s[pos]
253
+ escape_sequences = {"n": "\n", "t": "\t", "r": "\r", "\\": "\\", quote_char: quote_char}
254
+ result += escape_sequences.get(c, c)
255
+ elif c == quote_char:
256
+ pos += 1
257
+ # Attempt to convert to a number if possible
258
+ converted_value = convert_value(result)
259
+ return converted_value, pos
260
+ else:
261
+ result += c
262
+ pos += 1
263
+ raise ValueError("Unterminated string")
264
+
265
+ def parse_key(s, pos):
266
+ pos = skip_whitespace(s, pos)
267
+ if s[pos] in ('"', "'"):
268
+ key, pos = parse_string(s, pos)
269
+ return key, pos
270
+ else:
271
+ raise ValueError(f"Expected string for key at position {pos}")
272
+
273
+ def parse_object(s, pos):
274
+ obj = {}
275
+ assert s[pos] == "{"
276
+ pos += 1
277
+ pos = skip_whitespace(s, pos)
278
+ while pos < len(s) and s[pos] != "}":
279
+ pos = skip_whitespace(s, pos)
280
+ key, pos = parse_key(s, pos)
281
+ pos = skip_whitespace(s, pos)
282
+ if pos >= len(s) or s[pos] != ":":
283
+ raise ValueError(f'Expected ":" at position {pos}')
284
+ pos += 1
285
+ pos = skip_whitespace(s, pos)
286
+ value, pos = parse_value(s, pos)
287
+ obj[key] = value
288
+ pos = skip_whitespace(s, pos)
289
+ if pos < len(s) and s[pos] == ",":
290
+ pos += 1
291
+ pos = skip_whitespace(s, pos)
292
+ elif pos < len(s) and s[pos] == "}":
293
+ break
294
+ elif pos < len(s) and s[pos] != "}":
295
+ raise ValueError(f'Expected "," or "}}" at position {pos}')
296
+ if pos >= len(s) or s[pos] != "}":
297
+ raise ValueError(f'Expected "}}" at position {pos}')
298
+ pos += 1
299
+ return obj, pos
300
+
301
+ def parse_array(s, pos):
302
+ lst = []
303
+ assert s[pos] == "["
304
+ pos += 1
305
+ pos = skip_whitespace(s, pos)
306
+ while pos < len(s) and s[pos] != "]":
307
+ value, pos = parse_value(s, pos)
308
+ lst.append(value)
309
+ pos = skip_whitespace(s, pos)
310
+ if pos < len(s) and s[pos] == ",":
311
+ pos += 1
312
+ pos = skip_whitespace(s, pos)
313
+ elif pos < len(s) and s[pos] == "]":
314
+ break
315
+ elif pos < len(s) and s[pos] != "]":
316
+ raise ValueError(f'Expected "," or "]" at position {pos}')
317
+ if pos >= len(s) or s[pos] != "]":
318
+ raise ValueError(f'Expected "]" at position {pos}')
319
+ pos += 1
320
+ return lst, pos
321
+
322
+ def parse_triple_quoted_string(s, pos):
323
+ if s[pos : pos + 3] == "'''":
324
+ quote_str = "'''"
325
+ elif s[pos : pos + 3] == '"""':
326
+ quote_str = '"""'
327
+ else:
328
+ raise ValueError(f"Expected triple quotes at position {pos}")
329
+ pos += 3
330
+ result = ""
331
+ while pos < len(s):
332
+ if s[pos : pos + 3] == quote_str:
333
+ pos += 3
334
+ # Attempt to convert to a number if possible
335
+ converted_value = convert_value(result)
336
+ return converted_value, pos
337
+ else:
338
+ result += s[pos]
339
+ pos += 1
340
+ raise ValueError("Unterminated triple-quoted string")
341
+
342
+ def parse_value(s, pos):
343
+ pos = skip_whitespace(s, pos)
344
+ if pos >= len(s):
345
+ raise ValueError("Unexpected end of input")
346
+ if s[pos] == "{":
347
+ return parse_object(s, pos)
348
+ elif s[pos] == "[":
349
+ return parse_array(s, pos)
350
+ elif s[pos : pos + 3] in ("'''", '"""'):
351
+ return parse_triple_quoted_string(s, pos)
352
+ elif s[pos] in ('"', "'"):
353
+ return parse_string(s, pos)
354
+ elif s[pos : pos + 4].lower() == "true":
355
+ return True, pos + 4
356
+ elif s[pos : pos + 5].lower() == "false":
357
+ return False, pos + 5
358
+ elif s[pos : pos + 4].lower() == "null":
359
+ return None, pos + 4
360
+ elif s[pos] in "-+0123456789.":
361
+ return parse_number(s, pos)
362
+ else:
363
+ raise ValueError(f"Unexpected character at position {pos}: {s[pos]}")
364
+
365
+ json_start = s.index("{")
366
+ json_end = s.rfind("}")
367
+ s = s[json_start : json_end + 1]
368
+
369
+ s = s.strip()
370
+ result, pos = parse_value(s, 0)
371
+ pos = skip_whitespace(s, pos)
372
+ if pos != len(s):
373
+ raise ValueError(f"Unexpected content at position {pos}")
374
+ return result
ICL/DAPO/verl-recipe/collabllm/metrics/accuracy.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2025 CollabLLM team and/or its affiliates
2
+ # Copyright 2025 Bytedance Ltd. and/or its affiliates
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from recipe.collabllm.utils import extract_json, parse_messages
17
+
18
+ ACCURACY_PROMPT = '''You are a helpful and meticulous evaluator. Your task is to \
19
+ evaluate the *accuracy* of an AI model's answer to a target question. \
20
+ You will be given the target question, the ground truth answer, and the conversation between the AI and the user.
21
+
22
+ Provided Information:
23
+
24
+ <|The Start of Target Question and Ground Truth Answer|>
25
+ Target Question: {single_turn_prompt}
26
+ Ground Truth Answer: {ground_truth}
27
+ <|The End of Target Question and Ground Truth Answer|>
28
+
29
+ <|The Start of The Conversation|>
30
+ {chat_history}
31
+ <|The End of The Conversation|>
32
+
33
+ You should determine whether the model's final response to the target question is \
34
+ factually correct and consistent with the provided ground truth.
35
+
36
+ Rating criteria (binary):
37
+ • 1 = Correct — the response matches the ground truth.
38
+ • 0 = Incorrect — the response contradicts or misses the ground truth.
39
+
40
+ Output format (JSON):
41
+ {{
42
+ "thought": "<your reasoning here>",
43
+ "accuracy": <0 or 1>
44
+ }}
45
+
46
+ Double check if the JSON object is formatted correctly. Ensure that all fields are present and properly structured. \
47
+ Use " or """ to wrap up the thought and use single quotes inside the "thought" field to avoid JSON escape issues.
48
+
49
+ Your evaluation:
50
+ '''
51
+
52
+
53
+ async def compute_score(data_source, messages, ground_truth, extra_info, **kwargs):
54
+ # Check if litellm is available, fallback to openai if not
55
+ try:
56
+ import litellm
57
+
58
+ use_litellm = True
59
+ except ImportError:
60
+ # litellm not found, falling back to openai
61
+ import openai
62
+
63
+ use_litellm = False
64
+
65
+ chat_history = parse_messages(messages, strip_sys_prompt=True)
66
+ prompt = ACCURACY_PROMPT.format(
67
+ single_turn_prompt=extra_info["interaction_kwargs"]["single_turn_prompt"],
68
+ ground_truth=ground_truth,
69
+ chat_history=chat_history,
70
+ )
71
+
72
+ if use_litellm:
73
+ full_response = (
74
+ (
75
+ await litellm.acompletion(
76
+ messages=[{"role": "user", "content": prompt}],
77
+ **kwargs,
78
+ )
79
+ )
80
+ .choices[0]
81
+ .message.content
82
+ )
83
+ else:
84
+ client = openai.AsyncOpenAI() # Assumes API key is set in environment
85
+ full_response = (
86
+ (
87
+ await client.chat.completions.create(
88
+ messages=[{"role": "user", "content": prompt}],
89
+ **kwargs,
90
+ )
91
+ )
92
+ .choices[0]
93
+ .message.content
94
+ )
95
+
96
+ full_response = extract_json(full_response)
97
+
98
+ assert isinstance(full_response, dict), f"Expected a dict, got {type(full_response)}"
99
+ assert {"accuracy", "thought"}.issubset(full_response.keys()), (
100
+ f"Expected keys not found from {full_response.keys()}"
101
+ )
102
+
103
+ accuracy = full_response.pop("accuracy")
104
+ return float(accuracy)
ICL/DAPO/verl-recipe/collabllm/metrics/bleu_score.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2025 CollabLLM team and/or its affiliates
2
+ # Copyright 2025 Bytedance Ltd. and/or its affiliates
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from nltk.translate.bleu_score import sentence_bleu
17
+ from recipe.collabllm.utils import extract_json, parse_messages
18
+
19
+ EXTRACT_MULTITURN_COMPLETION_PROMPT = '''You are a thorough and diligent conversation analyzer. \
20
+ Your task is to extract the final and complete version of a document that was generated during \
21
+ a multiturn conversation between a user and a chat assistant. \
22
+ The extracted content should reflect the final and comprehensive response provided by the assistant \
23
+ based on the user’s request.
24
+
25
+ You will be provided with the conversation:
26
+
27
+ <|The Start of The Conversation|>
28
+ {chat_history}
29
+ <|The End of The Conversation|>
30
+
31
+ Instructions for Extraction:
32
+
33
+ 1. Identify the Most Update-to-Date Contents: Review the entire conversation to identify the most updated parts \
34
+ of the content provided by the assistant. This may include:
35
+ - Different sections of text (e.g., an essay, report, or article).
36
+
37
+ 2. Integrate Revisions: If the assistant made revisions, updates, or added sections throughout the conversation, \
38
+ ensure that these changes are fully integrated into the final content. The goal is to extract a single, cohesive \
39
+ output that incorporates all modifications and additions made during the conversation. For example, if the assistant \
40
+ writes an introducation at the beginning and move on to the conclusion, the final output should include both the \
41
+ introduction and the conclusion.
42
+
43
+ 3. Focus on Completeness:
44
+ - For text-based documents: Ensure that the extracted content is comprehensive and represents the full document \
45
+ or section as discussed in the conversation.
46
+
47
+ You should output a JSON object with two entries:
48
+ - "thought" (str): Output your thought process when extracting the final content.
49
+ 1. How do different parts of the conversation contribute to the final output?
50
+ 2. How do you make sure you included the most updated and complete information?
51
+ 3. How do you make sure you did not include any information that is not necessary?
52
+ - "final_completion" (str): The final and complete version of the document extracted from the conversation.
53
+
54
+ Note:
55
+ 1. If there are multiple lines, you should use triple quotes (""") to wrap the content. For example, \
56
+ "final_completion": """first line.
57
+ second line.""" or "thought": """first line;
58
+ second line.""".
59
+ 2. In the "final_completion" entry, replace all double quotes (") with single quotes (') to prevent JSON formatting \
60
+ issues. For example, you can output "final_completion": "'Hello World' is a common phrase."
61
+
62
+ Take a deep breath and carefully follow the instructions and guidelines provided.
63
+ '''
64
+
65
+
66
+ async def compute_score(data_source, messages, ground_truth, extra_info, **kwargs):
67
+ # Check if litellm is available, fallback to openai if not
68
+ try:
69
+ import litellm
70
+
71
+ use_litellm = True
72
+ except ImportError:
73
+ # litellm not found, falling back to openai
74
+ import openai
75
+
76
+ use_litellm = False
77
+
78
+ chat_history = parse_messages(messages, strip_sys_prompt=True)
79
+ prompt = EXTRACT_MULTITURN_COMPLETION_PROMPT.format(chat_history=chat_history)
80
+
81
+ if use_litellm:
82
+ full_response = (
83
+ (
84
+ await litellm.acompletion(
85
+ messages=[{"role": "user", "content": prompt}],
86
+ **kwargs,
87
+ )
88
+ )
89
+ .choices[0]
90
+ .message.content
91
+ )
92
+ else:
93
+ client = openai.AsyncOpenAI() # Assumes API key is set in environment
94
+ full_response = (
95
+ (
96
+ await client.chat.completions.create(
97
+ messages=[{"role": "user", "content": prompt}],
98
+ **kwargs,
99
+ )
100
+ )
101
+ .choices[0]
102
+ .message.content
103
+ )
104
+
105
+ full_response = extract_json(full_response)
106
+
107
+ assert isinstance(full_response, dict), f"Expected a dict, got {type(full_response)}"
108
+ assert {"final_completion", "thought"}.issubset(full_response.keys()), (
109
+ f"Expected keys not found from {full_response.keys()}"
110
+ )
111
+
112
+ final_completion = full_response.pop("final_completion")
113
+
114
+ bleu = sentence_bleu([ground_truth], final_completion)
115
+ return float(bleu)
ICL/DAPO/verl-recipe/collabllm/metrics/interactivity.py ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2025 CollabLLM team and/or its affiliates
2
+ # Copyright 2025 Bytedance Ltd. and/or its affiliates
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from recipe.collabllm.utils import extract_json, parse_messages
17
+
18
+ INTERACTIVITY_PROMPT = '''You are a helpful and meticulous conversation evaluator. \
19
+ Your task is to evaluate the interactivity of the responses provided by an AI assistant \
20
+ to user questions in a given conversation:
21
+
22
+ <|The Start of the Conversation to be Evaluated|>
23
+ {chat_history}
24
+ <|The End of the Conversation to be Evaluated|>
25
+
26
+ You should assess the assistant's engagement, clarity, and ability to understand the user's needs. \
27
+ Give a float number between 0 and 1.
28
+
29
+ Scoring Criteria:
30
+ - Let U = user understanding & response clarity ∈ [0,1]
31
+ - 1.0 = Fully understands the user's intent and gives a clear answer.
32
+ - 0.7 = Mostly understands and the answer is generally clear.
33
+ - 0.3 = Partially misunderstands or the answer is hard to follow.
34
+ - 0.0 = Misunderstands the intent and gives an unclear or irrelevant answer.
35
+ - Let Q = clarification in [0,1]
36
+ - 1.0 = Asks precise, necessary clarifying questions when needed.
37
+ - 0.7 = Asks somewhat helpful but incomplete clarifications.
38
+ - 0.3 = Only asks generic questions (e.g., “Does that help?”).
39
+ - 0.0 = Asks no clarifying questions when needed.
40
+ - Let S = suggestion helpfulness in [0,1]
41
+ - 1.0 = Provides useful, actionable suggestions.
42
+ - 0.7 = Suggestions are somewhat helpful but limited.
43
+ - 0.3 = Suggestions are vague or generic.
44
+ - 0.0 = No suggestions when they would clearly help.
45
+ score = average([U, Q, S])
46
+
47
+ Output format (JSON):
48
+ {{
49
+ "thought": "<How interactive is the assistant?>",
50
+ "interactivity": <score>
51
+ }}
52
+
53
+ Double check if the JSON object is formatted correctly. Ensure that all fields are present and properly structured. \
54
+ Use " or """ to wrap up the thought. You should not use other triple quotes inside the "thought" field. \
55
+ Instead you should use single quotes to avoid JSON escape issues.
56
+
57
+ Your evaluation:
58
+ '''
59
+
60
+
61
+ async def compute_score(data_source, messages, ground_truth, extra_info, **kwargs):
62
+ # Check if litellm is available, fallback to openai if not
63
+ try:
64
+ import litellm
65
+
66
+ use_litellm = True
67
+ except ImportError:
68
+ # litellm not found, falling back to openai
69
+ import openai
70
+
71
+ use_litellm = False
72
+
73
+ chat_history = parse_messages(messages, strip_sys_prompt=True)
74
+ prompt = INTERACTIVITY_PROMPT.format(chat_history=chat_history)
75
+
76
+ if use_litellm:
77
+ full_response = (
78
+ (
79
+ await litellm.acompletion(
80
+ messages=[{"role": "user", "content": prompt}],
81
+ **kwargs,
82
+ )
83
+ )
84
+ .choices[0]
85
+ .message.content
86
+ )
87
+ else:
88
+ client = openai.AsyncOpenAI() # Assumes API key is set in environment
89
+ full_response = (
90
+ (
91
+ await client.chat.completions.create(
92
+ messages=[{"role": "user", "content": prompt}],
93
+ **kwargs,
94
+ )
95
+ )
96
+ .choices[0]
97
+ .message.content
98
+ )
99
+
100
+ full_response = extract_json(full_response)
101
+
102
+ assert isinstance(full_response, dict), f"Expected a dict, got {type(full_response)}"
103
+ assert {"interactivity", "thought"}.issubset(full_response.keys()), (
104
+ f"Expected keys not found from {full_response.keys()}"
105
+ )
106
+
107
+ interactivity = full_response.pop("interactivity")
108
+ return float(interactivity)
ICL/DAPO/verl-recipe/collabllm/metrics/pass_rate.py ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2025 CollabLLM team and/or its affiliates
2
+ # Copyright 2025 Bytedance Ltd. and/or its affiliates
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from bigcodebench.eval import untrusted_check
17
+ from recipe.collabllm.utils import extract_json, parse_messages
18
+
19
+ EXTRACT_MULTITURN_COMPLETION_PROMPT = '''You are a thorough and diligent conversation analyzer. \
20
+ Your task is to extract the final and complete version of a code function {entry_point} that was generated \
21
+ during a multiturn conversation between a user and a chat assistant. \
22
+ The extracted content should reflect the final and comprehensive response provided by the \
23
+ assistant based on the user’s request.
24
+
25
+ You will be provided with the task and the conversation:
26
+
27
+ <|The Start of The Task|>
28
+ {single_turn_prompt}
29
+ <|The End of The Task|>
30
+
31
+ <|The Start of The Conversation|>
32
+ {chat_history}
33
+ <|The End of The Conversation|>
34
+
35
+ Instructions for Extraction:
36
+
37
+ 1. Identify the Most Update-to-Date Contents: Review the entire conversation to identify the most updated parts of \
38
+ the content provided by the assistant. This may include:
39
+ - Different parts of the code snippet, function, class, or script.
40
+
41
+ 2. Integrate Revisions: If the assistant made revisions, updates, or added sections throughout the conversation, \
42
+ ensure that these changes are fully integrated into the final content. The goal is to extract a single, cohesive \
43
+ output that incorporates all modifications and additions made during the conversation. For example, if the assistant \
44
+ writes a function at the beginning and changes a part, the final output should take the modification into account.
45
+
46
+ 3. Focus on Completeness:
47
+ - For code: Extract a complete and functional code snippet, including all necessary components such as imports, \
48
+ functions, classes, and any other essential elements. The code should be runnable, but you do not need to \
49
+ include any testing examples including the contents after `if __name__ == "__main__":`. Only the function code \
50
+ is required.
51
+
52
+ You should output a JSON object with two entries:
53
+ - "thought" (str): Output your thought process when extracting the final content.
54
+ 1. How do different parts of the conversation contribute to the final output?
55
+ 2. How do you make sure you included the most updated and complete information?
56
+ 3. How do you make sure you did not include any information that is not necessary?
57
+ - "final_completion" (str): The final and complete version of the code extracted from the conversation. \
58
+ Rename main function name for the task to {entry_point} if needed. Remove any comments wrapped by """.
59
+
60
+ Note:
61
+ 1. If there are multiple lines, you should use triple quotes (""") to wrap the content. For example, \
62
+ "final_completion": """first line.
63
+ second line.""" or "thought": """first line;
64
+ second line.""". You should not use other triple quotes inside.
65
+ 2. In the "final_completion" entry, replace all double quotes (") with single quotes (') to prevent JSON formatting \
66
+ issues. For example, you can output "final_completion": "'Hello World' is a common phrase."
67
+
68
+ Take a deep breath and carefully follow the instructions and guidelines provided.
69
+ '''
70
+
71
+
72
+ async def compute_score(data_source, messages, ground_truth, extra_info, **kwargs):
73
+ # Check if litellm is available, fallback to openai if not
74
+ try:
75
+ import litellm
76
+
77
+ use_litellm = True
78
+ except ImportError:
79
+ # litellm not found, falling back to openai
80
+ import openai
81
+
82
+ use_litellm = False
83
+
84
+ chat_history = parse_messages(messages, strip_sys_prompt=True)
85
+
86
+ prompt = EXTRACT_MULTITURN_COMPLETION_PROMPT.format(
87
+ chat_history=chat_history,
88
+ single_turn_prompt=extra_info["interaction_kwargs"]["single_turn_prompt"],
89
+ entry_point=extra_info["single_turn_metadata"]["entry_point"],
90
+ )
91
+
92
+ if use_litellm:
93
+ full_response = (
94
+ (
95
+ await litellm.acompletion(
96
+ messages=[{"role": "user", "content": prompt}],
97
+ **kwargs,
98
+ )
99
+ )
100
+ .choices[0]
101
+ .message.content
102
+ )
103
+ else:
104
+ client = openai.AsyncOpenAI() # Assumes API key is set in environment
105
+ full_response = (
106
+ (
107
+ await client.chat.completions.create(
108
+ messages=[{"role": "user", "content": prompt}],
109
+ **kwargs,
110
+ )
111
+ )
112
+ .choices[0]
113
+ .message.content
114
+ )
115
+
116
+ full_response = extract_json(full_response)
117
+
118
+ assert isinstance(full_response, dict), f"Expected a dict, got {type(full_response)}"
119
+ assert {"final_completion", "thought"}.issubset(full_response.keys()), (
120
+ f"Expected keys not found from {full_response.keys()}"
121
+ )
122
+
123
+ final_completion = full_response.pop("final_completion")
124
+ metadata = extra_info["single_turn_metadata"]
125
+ res = untrusted_check(
126
+ final_completion,
127
+ metadata["test"],
128
+ metadata["entry_point"],
129
+ max_as_limit=300 * 1024,
130
+ max_data_limit=300 * 1024,
131
+ max_stack_limit=300 * 1024,
132
+ min_time_limit=60,
133
+ gt_time_limit=60,
134
+ )
135
+ passed = res[0] == "pass"
136
+
137
+ # info = res[1] # for printing extra info
138
+ return float(passed)
ICL/DAPO/verl-recipe/collabllm/process_dataset.py ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2025 CollabLLM team and/or its affiliates
2
+ # Copyright 2025 Bytedance Ltd. and/or its affiliates
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ #!/usr/bin/env python3
17
+ """
18
+ # available datasets:
19
+ # math-hard(-large), medium(-large), bigcodebench(-large)
20
+ # to create your own dataset, refer to https://github.com/Wuyxin/collabllm
21
+
22
+ DATASET=math-hard-large
23
+
24
+ python recipe/collabllm/process_dataset.py \
25
+ --dataset collabllm/collabllm-multiturn-$DATASET \
26
+ --local_dir $HOME/data/collabllm-$DATASET \
27
+ --dataset_type sft
28
+
29
+ python recipe/collabllm/process_dataset.py \
30
+ --dataset collabllm/collabllm-multiturn-$DATASET \
31
+ --local_dir $HOME/data/collabllm-$DATASET \
32
+ --dataset_type rl
33
+
34
+
35
+ Preprocess collabllm/collabllm-multiturn-math-hard into (ground_truth, extra_info).
36
+
37
+ - ground_truth: picked from --prefer_field (default: single_turn_completion),
38
+ falling back to --fallback_field (default: completion)
39
+ - extra_info: a shallow copy of the original example plus bookkeeping fields
40
+ - reward_model: {"style": "rule", "ground_truth": ground_truth}
41
+
42
+ Saves one parquet per split into --local_dir and a small JSON preview.
43
+ """
44
+
45
+ import argparse
46
+ import json
47
+ import os
48
+ import uuid
49
+ from typing import Any, Optional
50
+
51
+ from datasets import Dataset, concatenate_datasets, load_dataset
52
+
53
+ SYSTEM_PROMPT = """The assistant is designed to be helpful, proactive, and highly interactive.
54
+
55
+ The assistant strives to accurately interpret the user's intent throughout the conversation, acknowledging previous
56
+ interactions to maintain context and continuity. If the user's message is unclear or lacks necessary details, the
57
+ assistant always asks for clarification rather than making assumptions. For example, if the user's request is
58
+ incomplete, the assistant responds with: "Could you provide more details so I can assist you better?"
59
+
60
+ The assistant asks specific follow-up questions and offers suggestions based on the user's needs, avoiding vague or
61
+ generic prompts. It proactively provides guidance and potential next steps, especially in complex tasks such as
62
+ writing, analysis, coding, and question answering.
63
+
64
+ The assistant is mindful of how much content the user needs to read or type, keeping interactions concise and
65
+ efficient. It reduces unnecessary repetition and ensures responses are relevant, well-structured, and free from
66
+ errors. When presenting options or asking for feedback, the assistant simplifies interactions by offering
67
+ multiple-choice answers or specific suggestions to make it easier for the user to respond quickly.
68
+
69
+ The assistant adapts its tone to align with the user's emotional state and style, adjusting its approach as needed.
70
+ If uncertain about something, the assistant honestly says, "I don't know," and suggests ways for the user to find
71
+ the information.
72
+
73
+ The assistant provides factually accurate, coherent, and relevant responses, using proper grammar and structure. It
74
+ remains interactive and proactive across all tasks, continually seeking feedback to refine and improve
75
+ interactions."""
76
+
77
+
78
+ # Required fields: "prompt", "ground_truth", "extra_info"
79
+ # In "extra_info" dict:
80
+ # (1) Rquired: "single_turn_prompt", which is the specific problem used to inform the user simulator,
81
+ # (2) Optional: "task_desc" (a short task description),
82
+ # (3) Optional: other fields for customized reward computation
83
+ def collapse_example(example: dict[str, Any]) -> dict[str, Any]:
84
+ if "prompt" not in example:
85
+ raise ValueError("Missing required 'prompt' field.")
86
+
87
+ ground_truth = (
88
+ example.get("ground_truth") or example.get("single_turn_completion") or example.get("completion") or ""
89
+ )
90
+
91
+ extra_info = {}
92
+ for k, v in example.items():
93
+ if k in ("prompt", "ground_truth", "extra_info"):
94
+ continue
95
+ extra_info.setdefault(k, v) # keep extra_info values if keys overlap
96
+
97
+ # make sure extra_info has the required fields
98
+ assert "single_turn_prompt" in extra_info, "Missing 'single_turn_prompt' in extra_info."
99
+
100
+ # add system prompt as the beginning of the list
101
+ example["prompt"] = [{"role": "system", "content": SYSTEM_PROMPT}] + example["prompt"]
102
+
103
+ extra_info.setdefault("prompt", example["prompt"]) # save the original prompt
104
+ extra_info.setdefault(
105
+ "interaction_kwargs",
106
+ {
107
+ "name": "collabllm",
108
+ "single_turn_prompt": extra_info.pop("single_turn_prompt"),
109
+ "task_desc": extra_info.pop("task_desc", "general ask-for-assistance task"),
110
+ },
111
+ )
112
+ return {
113
+ "prompt": example["prompt"],
114
+ "ground_truth": ground_truth,
115
+ "raw_prompt": example["prompt"], # save the original prompt
116
+ "extra_info": extra_info,
117
+ "reward_model": {"style": "rule", "ground_truth": ground_truth},
118
+ "data_source": "collabllm",
119
+ "agent_name": "collabllm_agent",
120
+ "index": str(uuid.uuid4()),
121
+ }
122
+
123
+
124
+ # ---------- IO helpers ----------
125
+ def save_parquet(ds_split: Dataset, filename: str, out_dir: str) -> None:
126
+ os.makedirs(out_dir, exist_ok=True)
127
+ path = os.path.join(out_dir, f"{filename}.parquet")
128
+ ds_split.to_parquet(path)
129
+ print(f"[OK] Wrote {filename}.parquet → {path} ({len(ds_split)} rows)")
130
+
131
+
132
+ def maybe_copy_to_hdfs(local_dir: str, hdfs_dir: Optional[str]) -> None:
133
+ if not hdfs_dir:
134
+ return
135
+ try:
136
+ from verl.utils.hdfs_io import copy, makedirs # type: ignore
137
+ except Exception as e:
138
+ print(f"[WARN] Skipping HDFS copy (verl not available): {e}")
139
+ return
140
+ makedirs(hdfs_dir)
141
+ copy(src=local_dir, dst=hdfs_dir)
142
+ print(f"[OK] Copied {local_dir} → {hdfs_dir}")
143
+
144
+
145
+ # ---------- Main ----------
146
+ def main():
147
+ ap = argparse.ArgumentParser()
148
+ ap.add_argument(
149
+ "--dataset", default="collabllm/collabllm-multiturn-math-hard", help="HF dataset path or local dir/file."
150
+ )
151
+ ap.add_argument("--task_desc", default="solving math problems", help="Task description for the dataset.")
152
+ ap.add_argument("--local_dir", default="~/data/collabllm-math-hard", help="Output directory.")
153
+ ap.add_argument("--hdfs_dir", default=None, help="Optional HDFS destination (requires verl).")
154
+ ap.add_argument(
155
+ "--validation_size", type=float, default=0.1, help="Validation split size (fraction or absolute int)."
156
+ )
157
+ ap.add_argument("--seed", type=int, default=42, help="Random seed for splitting.")
158
+ ap.add_argument("--num_proc", type=int, default=1, help="Parallel workers for map().")
159
+ ap.add_argument("--dataset_type", default="rl", choices=["rl", "sft"], help="Type of dataset (e.g., 'rl', 'sft').")
160
+ args = ap.parse_args()
161
+
162
+ out_dir = os.path.expanduser(args.local_dir)
163
+ os.makedirs(out_dir, exist_ok=True)
164
+
165
+ print(f"[INFO] Loading dataset: {args.dataset}")
166
+ ds_dict = load_dataset(args.dataset)
167
+ parts = list(ds_dict.values())
168
+ ds_all: Dataset = parts[0] if len(parts) == 1 else concatenate_datasets(parts)
169
+ # Dataset({
170
+ # features: ['prompt', 'completion', 'conv_id', 'score', 'single_turn_prompt',
171
+ # 'single_turn_completion', 'single_turn_metadata', 'turn_id', 'sessions', 'rewards'],
172
+ # num_rows: xxx
173
+ # })
174
+
175
+ if args.dataset_type == "rl":
176
+ # If multiple splits exist, merge them before collapsing/splitting.
177
+ ds_all = ds_all.map(lambda x: {"task_desc": args.task_desc}, num_proc=args.num_proc)
178
+
179
+ print(f"[INFO] Collapsing to formatted fields on {len(ds_all)} rows…")
180
+ ds_all = ds_all.map(
181
+ function=collapse_example,
182
+ remove_columns=ds_all.column_names,
183
+ num_proc=args.num_proc,
184
+ )
185
+
186
+ def dedup_by_prompt(dataset):
187
+ seen = set()
188
+ unique_rows = []
189
+ for ex in dataset:
190
+ prompt_key = json.dumps(ex["prompt"], sort_keys=True, ensure_ascii=False)
191
+ if prompt_key not in seen:
192
+ seen.add(prompt_key)
193
+ unique_rows.append(ex)
194
+ return Dataset.from_list(unique_rows)
195
+
196
+ ds_all = dedup_by_prompt(ds_all)
197
+
198
+ elif args.dataset_type == "sft":
199
+ df = ds_all.to_pandas()
200
+
201
+ # Sort so that within each conv_id the highest turn_id is first,
202
+ # and if multiple rows share the same turn_id, the highest score comes first
203
+ df = df.sort_values(["conv_id", "turn_id", "score"], ascending=[True, False, False])
204
+
205
+ # Keep only the top row per conv_id
206
+ df = df.drop_duplicates(subset="conv_id", keep="first")
207
+
208
+ # Back to HF Dataset
209
+ ds_all = Dataset.from_pandas(df, preserve_index=False)
210
+
211
+ # Append assistant response into prompt list
212
+ def append_completion(example):
213
+ example["prompt"] = (
214
+ [{"role": "system", "content": SYSTEM_PROMPT}]
215
+ + example["prompt"]
216
+ + [{"role": "assistant", "content": example["completion"]}]
217
+ )
218
+ return example
219
+
220
+ ds_all = ds_all.map(append_completion)
221
+
222
+ # Keep only prompt column
223
+ cols_to_remove = [col for col in ds_all.column_names if col != "prompt"]
224
+ ds_all = ds_all.remove_columns(cols_to_remove)
225
+
226
+ print(f"[INFO] Splitting with validation_size={args.validation_size}, seed={args.seed}")
227
+ split = ds_all.train_test_split(test_size=args.validation_size, seed=args.seed, shuffle=True)
228
+ train_ds, val_ds = split["train"], split["test"]
229
+ print(train_ds, val_ds)
230
+
231
+ save_parquet(train_ds, f"{args.dataset_type}_train", out_dir)
232
+ save_parquet(val_ds, f"{args.dataset_type}_validation", out_dir)
233
+
234
+ maybe_copy_to_hdfs(local_dir=out_dir, hdfs_dir=args.hdfs_dir)
235
+ print(f"[DONE] {args.dataset_type}_train.parquet and {args.dataset_type}_validation.parquet written.")
236
+
237
+
238
+ if __name__ == "__main__":
239
+ main()
ICL/DAPO/verl-recipe/collabllm/reward_function.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2025 CollabLLM team and/or its affiliates
2
+ # Copyright 2025 Bytedance Ltd. and/or its affiliates
3
+
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import asyncio
17
+ import importlib.util
18
+ import os
19
+ import sys
20
+ from typing import Any, Callable, Optional
21
+
22
+ import litellm
23
+ import torch
24
+ from transformers import PreTrainedTokenizer
25
+
26
+ from verl import DataProto
27
+ from verl.utils.reward_score import default_compute_score
28
+ from verl.workers.reward_manager import register
29
+ from verl.workers.reward_manager.abstract import AbstractRewardManager
30
+
31
+ TERMINATION_SIGNAL = "[[TERMINATE CHAT]]"
32
+
33
+
34
+ async def conversation_level_reward_func(
35
+ data_source, messages, ground_truth, extra_info, metrics, **kwargs
36
+ ) -> torch.Tensor:
37
+ """
38
+ Async version of conversation-level reward function.
39
+
40
+ Apply conversation-level reward function to the future interactions between the user simulator
41
+ and policy model, which are generated from `verl/interactions/collabllm_interation.py`
42
+ """
43
+ num_retries = kwargs.get("num_retries", 6)
44
+
45
+ rewards = {}
46
+ for metric in metrics:
47
+ current_dir = os.path.dirname(os.path.abspath(__file__))
48
+ metric_file_path = os.path.join(current_dir, f"metrics/{metric}.py")
49
+
50
+ if not os.path.exists(metric_file_path):
51
+ print(f"Error: Metric file '{metric_file_path}' not found. Assigning 0 to metric '{metric}'.")
52
+ rewards[metric] = 0.0
53
+ continue
54
+
55
+ spec = importlib.util.spec_from_file_location(f"metric_{metric}", metric_file_path)
56
+ if spec is None:
57
+ print(f"Error: Could not create spec for metric '{metric}'. Assigning 0 to metric '{metric}'.")
58
+ rewards[metric] = 0.0
59
+ continue
60
+
61
+ module = importlib.util.module_from_spec(spec)
62
+
63
+ try:
64
+ sys.modules[f"metric_{metric}"] = module
65
+ assert spec.loader is not None
66
+ spec.loader.exec_module(module)
67
+ except Exception as e:
68
+ print(f"Error loading metric module from '{metric_file_path}': {e}. Assigning 0 to metric '{metric}'.")
69
+ rewards[metric] = 0.0
70
+ continue
71
+
72
+ # Assume each metric file has a compute_score function
73
+ if not hasattr(module, "compute_score"):
74
+ print(
75
+ f"Error: Function 'compute_score' not found in '{metric_file_path}'. Assigning 0 to metric '{metric}'."
76
+ )
77
+ rewards[metric] = 0.0
78
+ continue
79
+
80
+ compute_score_fn = module.compute_score
81
+
82
+ # Retry mechanism for calling the metric function
83
+ for attempt in range(num_retries):
84
+ try:
85
+ # Call the metric function (await if it's async)
86
+ if asyncio.iscoroutinefunction(compute_score_fn):
87
+ rewards[metric] = await compute_score_fn(data_source, messages, ground_truth, extra_info, **kwargs)
88
+ else:
89
+ rewards[metric] = compute_score_fn(data_source, messages, ground_truth, extra_info, **kwargs)
90
+ break # Success, exit retry loop
91
+ except Exception as e:
92
+ if attempt == num_retries - 1: # Last attempt
93
+ print(
94
+ f"Error: Failed to compute metric '{metric}' after {num_retries} attempts. "
95
+ f"Last error: {e}. Assigning 0 to metric '{metric}'."
96
+ )
97
+ rewards[metric] = 0.0
98
+ else:
99
+ print(f"Attempt {attempt + 1} failed for metric '{metric}': {e}. Retrying...")
100
+ if isinstance(e, litellm.RateLimitError):
101
+ await asyncio.sleep(max(2**attempt, 60)) # Exponential backoff
102
+
103
+ # Return dict with metric names as keys
104
+ return {metric: torch.tensor(reward, dtype=torch.float32) for metric, reward in rewards.items()}
105
+
106
+
107
+ @register("collabllm")
108
+ class CollabLLMRewardManager(AbstractRewardManager):
109
+ """
110
+ The Reward Manager used in https://github.com/Wuyxin/collabllm/
111
+ """
112
+
113
+ def __init__(
114
+ self,
115
+ tokenizer: PreTrainedTokenizer,
116
+ num_examine: int,
117
+ metric_weights: dict,
118
+ llm_judge_kwargs: dict,
119
+ reward_fn_key: str = "data_source",
120
+ compute_score: Optional[Callable] = None,
121
+ normalize_by_data_source=False,
122
+ ) -> None:
123
+ self.tokenizer = tokenizer
124
+ self.num_examine = num_examine # the number of batches of decoded responses to print to the console
125
+ self.compute_score = compute_score or default_compute_score
126
+ self.reward_fn_key = reward_fn_key
127
+
128
+ self.metric_weights = metric_weights
129
+ self.llm_judge_kwargs = llm_judge_kwargs
130
+ self.normalize_by_data_source = normalize_by_data_source
131
+
132
+ self.metrics = list(self.metric_weights.keys())
133
+
134
+ def __call__(self, data: DataProto, return_dict: bool = False) -> torch.Tensor | dict[str, Any]:
135
+ # If there is rm score, we directly return rm score. Otherwise, we compute via rm_score_fn
136
+ if "rm_scores" in data.batch.keys():
137
+ if return_dict:
138
+ return {"reward_tensor": data.batch["rm_scores"]}
139
+ else:
140
+ return data.batch["rm_scores"]
141
+ # Use thread-compatible async loop management instead of asyncio.run()
142
+ loop = asyncio.new_event_loop()
143
+ asyncio.set_event_loop(loop)
144
+ try:
145
+ return loop.run_until_complete(self._compute_rewards_async(data, return_dict))
146
+ finally:
147
+ loop.close()
148
+
149
+ async def _compute_rewards_async(self, data: DataProto, return_dict: bool = False) -> torch.Tensor | dict[str, Any]:
150
+ # batched scoring
151
+ prompt_ids = data.batch["prompts"]
152
+ prompt_length = prompt_ids.shape[-1]
153
+ valid_response_length = data.batch["attention_mask"][:, prompt_length:].sum(dim=-1)
154
+
155
+ data_source = data.non_tensor_batch["data_source"]
156
+ ground_truth = data.non_tensor_batch["ground_truth"]
157
+ extra_info = data.non_tensor_batch["extra_info"]
158
+ message_lst = data.non_tensor_batch["messages"]
159
+
160
+ # batch the messages into multiple
161
+ num_repeat_rollouts = len(message_lst[0]["messages"])
162
+ batch_size = len(data_source)
163
+
164
+ grouped_messages = [
165
+ [message_lst[i]["messages"][j] for i in range(len(message_lst))] for j in range(num_repeat_rollouts)
166
+ ]
167
+
168
+ # Flatten lists for all batch items across all rollouts
169
+ flattened_data_sources = [data_source[i] for _ in range(num_repeat_rollouts) for i in range(batch_size)]
170
+ flattened_ground_truths = [ground_truth[i] for _ in range(num_repeat_rollouts) for i in range(batch_size)]
171
+ flattened_extra_infos = [extra_info[i] for _ in range(num_repeat_rollouts) for i in range(batch_size)]
172
+ flattened_messages = [grouped_messages[j][i] for j in range(num_repeat_rollouts) for i in range(batch_size)]
173
+
174
+ if num_repeat_rollouts > 0:
175
+ tasks = [
176
+ self.compute_score(
177
+ flattened_data_sources[i],
178
+ flattened_messages[i],
179
+ flattened_ground_truths[i],
180
+ flattened_extra_infos[i],
181
+ self.metrics,
182
+ **self.llm_judge_kwargs,
183
+ )
184
+ for i in range(len(flattened_data_sources))
185
+ ]
186
+ score_dicts = await asyncio.gather(*tasks)
187
+
188
+ # Aggregate scores for each metric across repeated rollouts
189
+ scores_by_metrics = {
190
+ metric: torch.stack([score_dict[metric] for score_dict in score_dicts])
191
+ .view(num_repeat_rollouts, -1)
192
+ .sum(dim=0)
193
+ for metric in self.metrics
194
+ }
195
+
196
+ # Apply metric-specific weights
197
+ weighted_scores_by_metrics = {
198
+ metric: torch.clamp(
199
+ scores_by_metrics[metric] * self.metric_weights[metric] / num_repeat_rollouts,
200
+ min=-1.0,
201
+ max=1.0,
202
+ )
203
+ for metric in self.metrics
204
+ }
205
+ # Compute mean of weighted scores for each metric
206
+ mean_weighted_scores_by_metrics = {
207
+ metric: weighted_scores_by_metrics[metric].mean(dim=0) for metric in self.metrics
208
+ }
209
+
210
+ # Combine weighted scores from all metrics into a single tensor
211
+ scores = torch.stack([weighted_scores_by_metrics[metric] for metric in self.metrics]).sum(dim=0)
212
+ else:
213
+ score_dicts = []
214
+ scores = torch.full((batch_size,), 0.0, dtype=torch.float32, device=prompt_ids.device)
215
+ mean_weighted_scores_by_metrics = {metric: 0.0 for metric in self.metrics}
216
+
217
+ print("Scores:", scores, mean_weighted_scores_by_metrics)
218
+
219
+ reward_tensor = torch.zeros_like(data.batch["responses"], dtype=torch.float32)
220
+
221
+ for i in range(len(data)):
222
+ reward_tensor[i, valid_response_length[i].item() - 1] = scores[i]
223
+
224
+ if return_dict:
225
+ return {"reward_tensor": reward_tensor}
226
+ else:
227
+ return reward_tensor
ICL/DAPO/verl-recipe/collabllm/train_rl_collabllm.sh ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Usage: sh recipe/collabllm/train_rl_collabllm.sh <optional resume path>
2
+
3
+ set -x
4
+
5
+ PROJECT_DIR="$(pwd)"
6
+ export VLLM_USE_V1=1
7
+
8
+ RESUME_PATH="${1:-}"
9
+
10
+ if [ -z "$RESUME_PATH" ]; then
11
+ RESUME_PATH=null
12
+ fi
13
+
14
+ DATASET=math-hard-large
15
+ PROJECT_DIR="$(pwd)"
16
+ AGENTLOOP_CONFIG_PATH="$PROJECT_DIR/recipe/collabllm/config/agent.yaml"
17
+
18
+
19
+ python3 -m verl.trainer.main_ppo \
20
+ trainer.val_before_train=False \
21
+ algorithm.adv_estimator=grpo \
22
+ data.train_files=$HOME/data/collabllm-$DATASET/rl_train.parquet \
23
+ data.val_files=$HOME/data/collabllm-$DATASET/rl_validation.parquet \
24
+ reward_model.reward_manager=collabllm \
25
+ +reward_model.reward_kwargs.metric_weights.accuracy=1 \
26
+ +reward_model.reward_kwargs.metric_weights.interactivity=1 \
27
+ +reward_model.reward_kwargs.metric_weights.token_amount=-0.0001 \
28
+ +reward_model.reward_kwargs.llm_judge_kwargs.model=gpt-4o-mini \
29
+ +reward_model.reward_kwargs.llm_judge_kwargs.max_tokens=2048 \
30
+ +reward_model.reward_kwargs.llm_judge_kwargs.temperature=0 \
31
+ data.train_batch_size=16 \
32
+ data.max_prompt_length=8196 \
33
+ data.max_response_length=2048 \
34
+ data.filter_overlong_prompts=True \
35
+ data.truncation='error' \
36
+ actor_rollout_ref.model.path="Qwen/Qwen2.5-7B-Instruct" \
37
+ actor_rollout_ref.actor.optim.lr=1e-6 \
38
+ actor_rollout_ref.model.use_remove_padding=True \
39
+ actor_rollout_ref.actor.ppo_mini_batch_size=8 \
40
+ actor_rollout_ref.actor.use_dynamic_bsz=True \
41
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
42
+ actor_rollout_ref.actor.use_kl_loss=True \
43
+ actor_rollout_ref.actor.kl_loss_coef=0.001 \
44
+ actor_rollout_ref.actor.kl_loss_type=low_var_kl \
45
+ actor_rollout_ref.actor.entropy_coeff=0 \
46
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
47
+ actor_rollout_ref.actor.fsdp_config.param_offload=True \
48
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
49
+ actor_rollout_ref.rollout.name=vllm \
50
+ actor_rollout_ref.rollout.mode=async \
51
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
52
+ actor_rollout_ref.rollout.n=8 \
53
+ actor_rollout_ref.rollout.temperature=1.0 \
54
+ actor_rollout_ref.rollout.free_cache_engine=True \
55
+ actor_rollout_ref.rollout.multi_turn.enable=true \
56
+ actor_rollout_ref.rollout.multi_turn.format=hermes \
57
+ actor_rollout_ref.rollout.multi_turn.max_user_turns=2 \
58
+ actor_rollout_ref.rollout.multi_turn.max_assistant_turns=3 \
59
+ actor_rollout_ref.rollout.multi_turn.num_repeat_rollouts=3 \
60
+ actor_rollout_ref.rollout.agent.agent_loop_config_path=$AGENTLOOP_CONFIG_PATH \
61
+ actor_rollout_ref.ref.fsdp_config.param_offload=True \
62
+ algorithm.use_kl_in_reward=False \
63
+ trainer.critic_warmup=0 \
64
+ trainer.logger='["console", "wandb"]' \
65
+ trainer.project_name=verlxcollabllm \
66
+ trainer.experiment_name=collabllm-qwen2.5-7B-$DATASET \
67
+ trainer.nnodes=1 \
68
+ trainer.n_gpus_per_node=8 \
69
+ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
70
+ trainer.save_freq=100 \
71
+ trainer.test_freq=10 \
72
+ trainer.total_epochs=20 \
73
+ custom_reward_function.path=recipe/collabllm/reward_function.py \
74
+ custom_reward_function.name=conversation_level_reward_func \
75
+ actor_rollout_ref.rollout.multi_turn.interaction_config_path="$PROJECT_DIR/recipe/collabllm/config/collabllm_interaction_config.yaml" \
76
+ trainer.resume_from_path=$RESUME_PATH
ICL/DAPO/verl-recipe/collabllm/train_sft_collabllm.sh ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ set -x
3
+
4
+ if [ "$#" -lt 1 ]; then
5
+ echo "Usage: sft_train_collabllm.sh [<nproc_per_node> other_configs...]"
6
+ exit 1
7
+ fi
8
+
9
+ nproc_per_node=$1
10
+
11
+ # Shift the arguments so $@ refers to the rest
12
+ shift 1
13
+
14
+ DATASET=math-hard-large
15
+
16
+ torchrun --nnodes=1 --nproc_per_node=$nproc_per_node \
17
+ -m verl.trainer.fsdp_sft_trainer \
18
+ data.train_files=$HOME/data/collabllm-$DATASET/sft_train.parquet \
19
+ data.val_files=$HOME/data/collabllm-$DATASET/sft_validation.parquet \
20
+ data.multiturn.enable=true \
21
+ data.multiturn.messages_key=prompt \
22
+ optim.lr=1e-6 \
23
+ data.train_batch_size=64 \
24
+ data.micro_batch_size_per_gpu=2 \
25
+ data.max_length=8196 \
26
+ model.partial_pretrain=Qwen/Qwen2.5-7B-Instruct \
27
+ trainer.project_name=collabllm-sft-$DATASET \
28
+ trainer.experiment_name=collabllm-sft-qwen2.5-7B-$DATASET \
29
+ trainer.logger=console \
30
+ trainer.total_epochs=3 $@ \
31
+ ulysses_sequence_parallel_size=1 \
32
+ use_remove_padding=true $@
ICL/DAPO/verl-recipe/dapo/README.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Recipe: Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
2
+
3
+ > Open-Source Algorithm Implementation & Expriement Running: [Yuxuan Tong](https://tongyx361.github.io/), [Guangming Sheng](https://hk.linkedin.com/in/guangming-sheng-b50640211)
4
+
5
+ > [!IMPORTANT]
6
+ >
7
+ > **🔥 News!!!**
8
+ >
9
+ > - [2025/04] We reproduced the results of two versions of DAPO ([Full](./run_dapo_qwen2.5_32b.sh) & [w/o Dynamic Sampling](./run_dapo_wo_ds_qwen2.5_32b.sh)), achieving 52% and 50% on AIME 2024 respectively, based on [the latest codebase on `recipe/dapo`](https://github.com/volcengine/verl/tree/recipe/dapo/recipe/dapo). Please check the details in [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n).
10
+ > - [2025/03] We published the training record of [an early version of DAPO (w/o Token-level PG Loss & Dynamic Sampling)](./run_dapo_early_qwen2.5_32b.sh), achieving 44% on AIME 2024, in [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n).
11
+
12
+ 🏠 [Homepage](https://dapo-sia.github.io/) | 📝 [Paper@arXiv](https://arxiv.org/abs/2503.14476) | 🤗 [Datasets&Models@HF](https://huggingface.co/collections/BytedTsinghua-SIA/dapo-67d7f1517ee33c8aed059da0) | 🐱 [Code@GitHub](https://github.com/volcengine/verl/tree/recipe/dapo/recipe/dapo) | 🐱 [Repo@GitHub](https://github.com/BytedTsinghua-SIA/DAPO)
13
+
14
+ > We propose the **D**ecoupled Clip and Dynamic s**A**mpling **P**olicy **O**ptimization (DAPO) algorithm. By making our work publicly available, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements. Our system is based on the awesome [verl](https://github.com/volcengine/verl) framework. Thanks for their great work! Applying DAPO training to Qwen2.5-32B base model proves to outperform the previous state-of-the-art DeepSeek-R1-Zero-Qwen-32B on AIME 2024, achieving **50%** accuracy with **50%** less training steps.
15
+ >
16
+ > ![dapo-main-result](https://dapo-sia.github.io/static/images/score.png)
17
+
18
+ ## Quickstart
19
+
20
+ 1. Prepare the datasets **on the Ray cluster**:
21
+
22
+ ```bash
23
+ bash prepare_dapo_data.sh # This downloads the datasets to ${HOME}/verl/data by default
24
+ ```
25
+
26
+ 2. Submit the job to the Ray cluster **from any machine**:
27
+
28
+ ```bash
29
+ cd verl # Repo root
30
+ export RAY_ADDRESS="http://${RAY_IP:-localhost}:8265" # The Ray cluster address to connect to
31
+ export WORKING_DIR="${PWD}" # The local directory to package to the Ray cluster
32
+ # Set the runtime environment like env vars and pip packages for the Ray cluster in yaml
33
+ export RUNTIME_ENV="./recipe/dapo/runtime_env.yaml" # This sets environment variables for the Ray cluster
34
+ bash recipe/dapo/run_dapo_qwen2.5_32b.sh # or other scripts
35
+ ```
36
+
37
+ ## Reproduction Runs
38
+
39
+ | Setup | AIME 2024 Acc. | Hardware | Image | Commit | Environment Variables | Training Script | Training Record |
40
+ | -------------------------------------------- | -------------- | --------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
41
+ | DAPO | 52% | 16x8xH800 | `hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.3-flashinfer0.2.2-cxx11abi0` | [`4f80e4`](https://github.com/volcengine/verl/tree/4f80e465c2ec79ab9c3c30ec74b9745de61d0490) | [runtime_env.yaml](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/runtime_env.yaml) | [run_dapo_qwen2.5_32b.sh](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/run_dapo_qwen2.5_32b.sh) | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
42
+ | DAPO w/o Dynamic Sampling | 50% | 16x8xH800 | `hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.3-flashinfer0.2.2-cxx11abi0` | [`4f80e4`](https://github.com/volcengine/verl/tree/4f80e465c2ec79ab9c3c30ec74b9745de61d0490) | [runtime_env.yaml](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/runtime_env.yaml) | [run_dapo_wo_ds_qwen2.5_32b.sh](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/run_dapo_wo_ds_qwen2.5_32b.sh) | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
43
+ | DAPO w/o Token-level Loss & Dynamic Sampling | 44% | 16x8xH20 | `hiyouga/verl:ngc-th2.5.1-cu120-vllm0.7.4-hotfix` | [`4f80e4`](https://github.com/volcengine/verl/tree/4f80e465c2ec79ab9c3c30ec74b9745de61d0490) | [runtime_env.yaml](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/runtime_env.yaml) | [run_dapo_early_qwen2.5_32b.sh](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/run_dapo_early_qwen2.5_32b.sh) | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
44
+
45
+ > [!IMPORTANT]
46
+ >
47
+ > **📢 Call for Contribution!**
48
+ >
49
+ > Welcome to submit your reproduction runs and setups!
50
+
51
+ ## Configuration
52
+
53
+ ### Separated Clip Epsilons (-> Clip-Higher)
54
+
55
+ An example configuration:
56
+
57
+ ```yaml
58
+ actor_rollout_ref:
59
+ actor:
60
+ clip_ratio_low: 0.2
61
+ clip_ratio_high: 0.28
62
+ ```
63
+
64
+ `clip_ratio_low` and `clip_ratio_high` specify the $\varepsilon_{\text {low }}$ and $\varepsilon_{\text {high }}$ in the DAPO objective.
65
+
66
+ Core relevant code:
67
+
68
+ ```python
69
+ pg_losses1 = -advantages * ratio
70
+ pg_losses2 = -advantages * torch.clamp(ratio, 1 - cliprange_low, 1 + cliprange_high)
71
+ pg_losses = torch.maximum(pg_losses1, pg_losses2)
72
+ ```
73
+
74
+ ### Dynamic Sampling (with Group Filtering)
75
+
76
+ An example configuration:
77
+
78
+ ```yaml
79
+ data:
80
+ gen_batch_size: 1536
81
+ train_batch_size: 512
82
+ algorithm:
83
+ filter_groups:
84
+ enable: True
85
+ metric: acc # score / seq_reward / seq_final_reward / ...
86
+ max_num_gen_batches: 10 # Non-positive values mean no upper limit
87
+ ```
88
+
89
+ Setting `filter_groups.enable` to `True` will filter out groups whose outputs' `metric` are all the same, e.g., for `acc`, groups whose outputs' accuracies are all 1 or 0.
90
+
91
+ The trainer will repeat sampling with `gen_batch_size` until there are enough qualified groups for `train_batch_size` or reaching the upper limit specified by `max_num_gen_batches`.
92
+
93
+ Core relevant code:
94
+
95
+ ```python
96
+ prompt_bsz = self.config.data.train_batch_size
97
+ if num_prompt_in_batch < prompt_bsz:
98
+ print(f'{num_prompt_in_batch=} < {prompt_bsz=}')
99
+ num_gen_batches += 1
100
+ max_num_gen_batches = self.config.algorithm.filter_groups.max_num_gen_batches
101
+ if max_num_gen_batches <= 0 or num_gen_batches < max_num_gen_batches:
102
+ print(f'{num_gen_batches=} < {max_num_gen_batches=}. Keep generating...')
103
+ continue
104
+ else:
105
+ raise ValueError(
106
+ f'{num_gen_batches=} >= {max_num_gen_batches=}. Generated too many. Please check your data.'
107
+ )
108
+ else:
109
+ # Align the batch
110
+ traj_bsz = self.config.data.train_batch_size * self.config.actor_rollout_ref.rollout.n
111
+ batch = batch[:traj_bsz]
112
+ ```
113
+
114
+ ### Flexible Loss Aggregation Mode (-> Token-level Loss)
115
+
116
+ An example configuration:
117
+
118
+ ```yaml
119
+ actor_rollout_ref:
120
+ actor:
121
+ loss_agg_mode: "token-mean" # / "seq-mean-token-sum" / "seq-mean-token-mean"
122
+ # NOTE: "token-mean" is the default behavior
123
+ ```
124
+
125
+ Setting `loss_agg_mode` to `token-mean` will mean the (policy gradient) loss across all the tokens in all the sequences in a mini-batch.
126
+
127
+ Core relevant code:
128
+
129
+ ```python
130
+ if loss_agg_mode == "token-mean":
131
+ loss = verl_F.masked_mean(loss_mat, loss_mask)
132
+ elif loss_agg_mode == "seq-mean-token-sum":
133
+ seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) # token-sum
134
+ loss = torch.mean(seq_losses) # seq-mean
135
+ elif loss_agg_mode == "seq-mean-token-mean":
136
+ seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) / torch.sum(loss_mask, dim=-1) # token-mean
137
+ loss = torch.mean(seq_losses) # seq-mean
138
+ else:
139
+ raise ValueError(f"Invalid loss_agg_mode: {loss_agg_mode}")
140
+ ```
141
+
142
+ ### Overlong Reward Shaping
143
+
144
+ An example configuration:
145
+
146
+ ```yaml
147
+ data:
148
+ max_response_length: 20480 # 16384 + 4096
149
+ reward_model:
150
+ overlong_buffer:
151
+ enable: True
152
+ len: 4096
153
+ penalty_factor: 1.0
154
+ ```
155
+
156
+ Setting `overlong_buffer.enable` to `True` will penalize the outputs whose lengths are overlong but still within the hard context limit.
157
+
158
+ Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length - overlong_buffer.len` by `0` to `overlong_buffer.len` tokens.
159
+
160
+ Core relevant code:
161
+
162
+ ```python
163
+ if self.overlong_buffer_cfg.enable:
164
+ overlong_buffer_len = self.overlong_buffer_cfg.len
165
+ expected_len = self.max_resp_len - overlong_buffer_len
166
+ exceed_len = valid_response_length - expected_len
167
+ overlong_penalty_factor = self.overlong_buffer_cfg.penalty_factor
168
+ overlong_reward = min(-exceed_len / overlong_buffer_len * overlong_penalty_factor, 0)
169
+ reward += overlong_reward
170
+ ```
171
+
172
+ ## FAQ
173
+
174
+ ### Where is the "Overlong Filtering" in the paper?
175
+
176
+ Most experiments in the paper, including the best-performant one, are run without Overlong Filtering because it's somehow overlapping with Overlong Reward Shaping in terms of properly learning from the longest outputs. So we don't implement it here.
177
+
178
+ ### What's the difference between [the `recipe/dapo` directory in the `main` branch](https://github.com/volcengine/verl-recipe/tree/main/dapo) and the [`recipe/dapo` branch](https://github.com/volcengine/verl/tree/recipe/dapo/recipe/dapo)?
179
+
180
+ [The `recipe/dapo` branch](https://github.com/volcengine/verl/tree/recipe/dapo/recipe/dapo) is for **as-is reproduction** and thus won't be updated with new features.
181
+
182
+ [The `recipe/dapo` directory in the `main` branch](https://github.com/volcengine/verl-recipe/tree/main/dapo) works as an example of how to extend the latest `verl` to implement an algorithm recipe, which will be maintained with new features.
183
+
184
+ ### Why can't I produce similar results after modifications?
185
+
186
+ RL infrastructures nowadays still have inherent unrobustness, on which we are still working hard to improve.
187
+
188
+ We strongly recommend to only modify one thing at a time.
189
+
190
+ We also list some known problems here:
191
+
192
+ 1. Enabling CUDA graph (`enforce_eager=False`) might cause model performance degradation, whose cause is still under investigation.
ICL/DAPO/verl-recipe/dapo/dapo_ray_trainer.py ADDED
@@ -0,0 +1,418 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2024 Bytedance Ltd. and/or its affiliates
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """
15
+ FSDP PPO Trainer with Ray-based single controller.
16
+ This trainer supports model-agonistic model initialization with huggingface
17
+ """
18
+
19
+ import os
20
+ import uuid
21
+ from collections import defaultdict
22
+ from copy import deepcopy
23
+ from pprint import pprint
24
+
25
+ import numpy as np
26
+ import torch
27
+ from tqdm import tqdm
28
+
29
+ from verl import DataProto
30
+ from verl.trainer.ppo.core_algos import agg_loss
31
+ from verl.trainer.ppo.metric_utils import compute_data_metrics, compute_throughout_metrics, compute_timing_metrics
32
+ from verl.trainer.ppo.ray_trainer import (
33
+ AdvantageEstimator,
34
+ RayPPOTrainer,
35
+ apply_kl_penalty,
36
+ compute_advantage,
37
+ compute_response_mask,
38
+ )
39
+ from verl.trainer.ppo.reward import compute_reward
40
+ from verl.utils.metric import reduce_metrics
41
+ from verl.utils.profiler import marked_timer
42
+ from verl.utils.rollout_skip import RolloutSkip
43
+
44
+
45
+ class RayDAPOTrainer(RayPPOTrainer):
46
+ """
47
+ Note that this trainer runs on the driver process on a single CPU/GPU node.
48
+ """
49
+
50
+ def compute_kl_related_metrics(self, batch: DataProto, metrics: dict, timing_raw: dict):
51
+ batch.batch["response_mask"] = compute_response_mask(batch)
52
+
53
+ # recompute old_log_probs
54
+ with marked_timer("old_log_prob", timing_raw, "blue"):
55
+ old_log_prob = self.actor_rollout_wg.compute_log_prob(batch)
56
+ entropys = old_log_prob.batch["entropys"]
57
+ response_masks = batch.batch["response_mask"]
58
+ loss_agg_mode = self.config.actor_rollout_ref.actor.loss_agg_mode
59
+ entropy_agg = agg_loss(loss_mat=entropys, loss_mask=response_masks, loss_agg_mode=loss_agg_mode)
60
+ old_log_prob_metrics = {"actor/entropy": entropy_agg.detach().item()}
61
+ metrics.update(old_log_prob_metrics)
62
+ old_log_prob.batch.pop("entropys")
63
+ batch = batch.union(old_log_prob)
64
+
65
+ if self.use_reference_policy:
66
+ # compute reference log_prob
67
+ with marked_timer("ref", timing_raw, "olive"):
68
+ if not self.ref_in_actor:
69
+ ref_log_prob = self.ref_policy_wg.compute_ref_log_prob(batch)
70
+ else:
71
+ ref_log_prob = self.actor_rollout_wg.compute_ref_log_prob(batch)
72
+ batch = batch.union(ref_log_prob)
73
+
74
+ return batch
75
+
76
+ def fit(self):
77
+ """
78
+ The training loop of PPO.
79
+ The driver process only need to call the compute functions of the worker group through RPC
80
+ to construct the PPO dataflow.
81
+ The light-weight advantage computation is done on the driver process.
82
+ """
83
+ from omegaconf import OmegaConf
84
+
85
+ from verl.utils.tracking import Tracking
86
+
87
+ logger = Tracking(
88
+ project_name=self.config.trainer.project_name,
89
+ experiment_name=self.config.trainer.experiment_name,
90
+ default_backend=self.config.trainer.logger,
91
+ config=OmegaConf.to_container(self.config, resolve=True),
92
+ )
93
+
94
+ self.global_steps = 0
95
+ self.gen_steps = 0
96
+
97
+ # load checkpoint before doing anything
98
+ self._load_checkpoint()
99
+
100
+ # perform validation before training
101
+ # currently, we only support validation using the reward_function.
102
+ if self.val_reward_fn is not None and self.config.trainer.get("val_before_train", True):
103
+ val_metrics = self._validate()
104
+ assert val_metrics, f"{val_metrics=}"
105
+ pprint(f"Initial validation metrics: {val_metrics}")
106
+ logger.log(data=val_metrics, step=self.global_steps)
107
+ if self.config.trainer.get("val_only", False):
108
+ return
109
+
110
+ if self.config.actor_rollout_ref.rollout.get("skip_rollout", False):
111
+ rollout_skip = RolloutSkip(self.config, self.actor_rollout_wg)
112
+ rollout_skip.wrap_generate_sequences()
113
+
114
+ # add tqdm
115
+ progress_bar = tqdm(total=self.total_training_steps, initial=self.global_steps, desc="Training Progress")
116
+
117
+ # we start from step 1
118
+ self.global_steps += 1
119
+ self.gen_steps += 1
120
+ last_val_metrics = None
121
+
122
+ prev_step_profile = False
123
+ curr_step_profile = (
124
+ self.global_steps in self.config.global_profiler.steps
125
+ if self.config.global_profiler.steps is not None
126
+ else False
127
+ )
128
+ next_step_profile = False
129
+
130
+ timing_raw = defaultdict(float)
131
+ batch = None
132
+ num_prompt_in_batch = 0
133
+ num_gen_batches = 0
134
+ for epoch in range(self.config.trainer.total_epochs):
135
+ for batch_dict in self.train_dataloader:
136
+ if hasattr(self.actor_rollout_wg, "async_calls_finalize_fn_exec"):
137
+ self.actor_rollout_wg.async_calls_finalize_fn_exec(blocking=False)
138
+ metrics = {}
139
+
140
+ with marked_timer("start_profile", timing_raw):
141
+ self._start_profiling(
142
+ not prev_step_profile and curr_step_profile
143
+ if self.config.global_profiler.profile_continuous_steps
144
+ else curr_step_profile
145
+ )
146
+
147
+ new_batch: DataProto = DataProto.from_single_dict(batch_dict)
148
+ num_gen_batches += 1
149
+ gen_batch = self._get_gen_batch(new_batch)
150
+ gen_batch_output = gen_batch.repeat(
151
+ repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True
152
+ )
153
+
154
+ is_last_step = self.global_steps >= self.total_training_steps
155
+
156
+ with marked_timer("step", timing_raw):
157
+ # generate a batch
158
+ with marked_timer("gen", timing_raw, "red"):
159
+ gen_batch_output = self.async_rollout_manager.generate_sequences(gen_batch_output)
160
+ timing_raw.update(gen_batch_output.meta_info["timing"])
161
+ gen_batch_output.meta_info.pop("timing", None)
162
+
163
+ if self.config.algorithm.adv_estimator == AdvantageEstimator.REMAX:
164
+ with marked_timer("gen_max", timing_raw, "red"):
165
+ gen_baseline_batch = deepcopy(gen_batch)
166
+ gen_baseline_batch.meta_info["do_sample"] = False
167
+ gen_baseline_output = self.async_rollout_manager.generate_sequences(gen_baseline_batch)
168
+
169
+ new_batch = new_batch.union(gen_baseline_output)
170
+ # compute reward model score on new_batch
171
+ rm_scores = None
172
+ if self.use_rm and "rm_scores" not in new_batch.batch.keys():
173
+ rm_scores = self.rm_wg.compute_rm_score(new_batch)
174
+ new_batch = new_batch.union(rm_scores)
175
+ reward_baseline_tensor, _ = compute_reward(new_batch, self.reward_fn)
176
+ reward_baseline_tensor = reward_baseline_tensor.sum(dim=-1)
177
+
178
+ keys_to_pop = set(gen_baseline_output.batch.keys())
179
+ if rm_scores is not None:
180
+ keys_to_pop.update(rm_scores.batch.keys())
181
+ new_batch.pop(batch_keys=list(keys_to_pop))
182
+
183
+ new_batch.batch["reward_baselines"] = reward_baseline_tensor
184
+
185
+ del rm_scores, gen_baseline_batch, gen_baseline_output
186
+
187
+ new_batch.non_tensor_batch["uid"] = np.array(
188
+ [str(uuid.uuid4()) for _ in range(len(new_batch.batch))], dtype=object
189
+ )
190
+ # repeat to align with repeated responses in rollout
191
+ new_batch = new_batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
192
+ new_batch = new_batch.union(gen_batch_output)
193
+
194
+ if self.config.algorithm.use_kl_in_reward:
195
+ # We need these metrics for apply_kl_penalty if using kl in reward
196
+ new_batch = self.compute_kl_related_metrics(new_batch, metrics, timing_raw)
197
+ # otherwise, we will compute those after dynamic sampling
198
+
199
+ with marked_timer("reward", timing_raw, "yellow"):
200
+ # compute scores. Support both model and function-based.
201
+ # We first compute the scores using reward model. Then, we call reward_fn to combine
202
+ # the results from reward model and rule-based results.
203
+ if self.use_rm and "rm_scores" not in new_batch.batch.keys():
204
+ # we first compute reward model score
205
+ reward_tensor = self.rm_wg.compute_rm_score(new_batch)
206
+ new_batch = new_batch.union(reward_tensor)
207
+
208
+ # we combine with rule-based rm
209
+ reward_tensor, reward_extra_infos_dict = compute_reward(new_batch, self.reward_fn)
210
+
211
+ new_batch.batch["token_level_scores"] = reward_tensor
212
+
213
+ if reward_extra_infos_dict:
214
+ new_batch.non_tensor_batch.update(
215
+ {k: np.array(v) for k, v in reward_extra_infos_dict.items()}
216
+ )
217
+
218
+ # compute rewards. apply_kl_penalty if available
219
+ if self.config.algorithm.use_kl_in_reward:
220
+ new_batch, kl_metrics = apply_kl_penalty(
221
+ new_batch, kl_ctrl=self.kl_ctrl_in_reward, kl_penalty=self.config.algorithm.kl_penalty
222
+ )
223
+ metrics.update(
224
+ kl_metrics
225
+ ) # TODO: This will be cleared if we use multiple genenration batches
226
+ else:
227
+ new_batch.batch["token_level_rewards"] = new_batch.batch["token_level_scores"]
228
+
229
+ if not self.config.algorithm.filter_groups.enable:
230
+ batch = new_batch
231
+ else: # NOTE: When prompts after filtering is less than train batch size,
232
+ # we skip to the next generation batch
233
+ metric_name = self.config.algorithm.filter_groups.metric
234
+ if metric_name == "seq_final_reward":
235
+ # Turn to numpy for easier filtering
236
+ new_batch.non_tensor_batch["seq_final_reward"] = (
237
+ new_batch.batch["token_level_rewards"].sum(dim=-1).numpy()
238
+ )
239
+ elif metric_name == "seq_reward":
240
+ new_batch.non_tensor_batch["seq_reward"] = (
241
+ new_batch.batch["token_level_scores"].sum(dim=-1).numpy()
242
+ )
243
+
244
+ # Collect the sequence reward for each trajectory
245
+ prompt_uid2metric_vals = defaultdict(list)
246
+ for uid, metric_val in zip(
247
+ new_batch.non_tensor_batch["uid"], new_batch.non_tensor_batch[metric_name], strict=True
248
+ ):
249
+ prompt_uid2metric_vals[uid].append(metric_val)
250
+
251
+ prompt_uid2metric_std = {}
252
+ for prompt_uid, metric_vals in prompt_uid2metric_vals.items():
253
+ prompt_uid2metric_std[prompt_uid] = np.std(metric_vals)
254
+
255
+ kept_prompt_uids = [
256
+ uid
257
+ for uid, std in prompt_uid2metric_std.items()
258
+ if std > 0 or len(prompt_uid2metric_vals[uid]) == 1
259
+ ]
260
+ num_prompt_in_batch += len(kept_prompt_uids)
261
+
262
+ kept_traj_idxs = []
263
+ for idx, traj_from_prompt_uid in enumerate(new_batch.non_tensor_batch["uid"]):
264
+ if traj_from_prompt_uid in kept_prompt_uids:
265
+ kept_traj_idxs.append(idx)
266
+
267
+ new_batch = new_batch[kept_traj_idxs]
268
+ batch = new_batch if batch is None else DataProto.concat([batch, new_batch])
269
+
270
+ prompt_bsz = self.config.data.train_batch_size
271
+ if num_prompt_in_batch < prompt_bsz:
272
+ print(f"{num_prompt_in_batch=} < {prompt_bsz=}")
273
+ max_num_gen_batches = self.config.algorithm.filter_groups.max_num_gen_batches
274
+ if max_num_gen_batches <= 0 or num_gen_batches < max_num_gen_batches:
275
+ print(f"{num_gen_batches=}. Keep generating...")
276
+ self.gen_steps += 1
277
+ is_last_step = self.global_steps >= self.total_training_steps
278
+ continue
279
+ else:
280
+ raise ValueError(
281
+ f"{num_gen_batches=} >= {max_num_gen_batches=}."
282
+ + " Generated too many. Please check if your data are too difficult."
283
+ + " You could also try set max_num_gen_batches=0 to enable endless trials."
284
+ )
285
+ else:
286
+ # Align the batch
287
+ traj_bsz = self.config.data.train_batch_size * self.config.actor_rollout_ref.rollout.n
288
+ batch = batch[:traj_bsz]
289
+
290
+ # === Updating ===
291
+ # Balance the number of valid tokens across DP ranks.
292
+ # NOTE: This usually changes the order of data in the `batch`,
293
+ # which won't affect the advantage calculation (since it's based on uid),
294
+ # but might affect the loss calculation (due to the change of mini-batching).
295
+ # TODO: Decouple the DP balancing and mini-batching.
296
+ if self.config.trainer.balance_batch:
297
+ self._balance_batch(batch, metrics=metrics)
298
+
299
+ # compute global_valid tokens
300
+ batch.meta_info["global_token_num"] = torch.sum(batch.batch["attention_mask"], dim=-1).tolist()
301
+
302
+ if not self.config.algorithm.use_kl_in_reward:
303
+ batch = self.compute_kl_related_metrics(batch, metrics, timing_raw)
304
+
305
+ # compute values
306
+ if self.use_critic:
307
+ with marked_timer("values", timing_raw, "cyan"):
308
+ values = self.critic_wg.compute_values(batch)
309
+ batch = batch.union(values)
310
+
311
+ # Compute rollout correction weights and off-policy metrics (inherited from RayPPOTrainer)
312
+ from verl.trainer.ppo.rollout_corr_helper import compute_rollout_correction_and_add_to_batch
313
+
314
+ rollout_corr_config = self.config.algorithm.get("rollout_correction", None)
315
+ if rollout_corr_config is not None and "rollout_log_probs" in batch.batch:
316
+ batch, is_metrics = compute_rollout_correction_and_add_to_batch(batch, rollout_corr_config)
317
+ # IS and off-policy metrics already have rollout_corr/ prefix
318
+ metrics.update(is_metrics)
319
+
320
+ with marked_timer("adv", timing_raw, "brown"):
321
+ # compute advantages, executed on the driver process
322
+ norm_adv_by_std_in_grpo = self.config.algorithm.get("norm_adv_by_std_in_grpo", True)
323
+ batch = compute_advantage(
324
+ batch,
325
+ adv_estimator=self.config.algorithm.adv_estimator,
326
+ gamma=self.config.algorithm.gamma,
327
+ lam=self.config.algorithm.lam,
328
+ num_repeat=self.config.actor_rollout_ref.rollout.n,
329
+ norm_adv_by_std_in_grpo=norm_adv_by_std_in_grpo,
330
+ )
331
+
332
+ # update critic
333
+ if self.use_critic:
334
+ with marked_timer("update_critic", timing_raw, "pink"):
335
+ critic_output = self.critic_wg.update_critic(batch)
336
+ critic_output_metrics = reduce_metrics(critic_output.meta_info["metrics"])
337
+ metrics.update(critic_output_metrics)
338
+
339
+ # implement critic warmup
340
+ if self.config.trainer.critic_warmup <= self.global_steps:
341
+ # update actor
342
+ with marked_timer("update_actor", timing_raw, "red"):
343
+ actor_output = self.actor_rollout_wg.update_actor(batch)
344
+ actor_output_metrics = reduce_metrics(actor_output.meta_info["metrics"])
345
+ metrics.update(actor_output_metrics)
346
+
347
+ # Log rollout generations if enabled
348
+ rollout_data_dir = self.config.trainer.get("rollout_data_dir", None)
349
+ if rollout_data_dir:
350
+ self._log_rollout_data(batch, reward_extra_infos_dict, timing_raw, rollout_data_dir)
351
+
352
+ # validate
353
+ if (
354
+ self.val_reward_fn is not None
355
+ and self.config.trainer.test_freq > 0
356
+ and (is_last_step or self.global_steps % self.config.trainer.test_freq == 0)
357
+ ):
358
+ with marked_timer("testing", timing_raw, "green"):
359
+ val_metrics: dict = self._validate()
360
+ if is_last_step:
361
+ last_val_metrics = val_metrics
362
+ metrics.update(val_metrics)
363
+
364
+ if self.config.trainer.save_freq > 0 and (
365
+ is_last_step or self.global_steps % self.config.trainer.save_freq == 0
366
+ ):
367
+ with marked_timer("save_checkpoint", timing_raw, "green"):
368
+ self._save_checkpoint()
369
+
370
+ with marked_timer("stop_profile", timing_raw):
371
+ next_step_profile = (
372
+ self.global_steps + 1 in self.config.global_profiler.steps
373
+ if self.config.global_profiler.steps is not None
374
+ else False
375
+ )
376
+ self._stop_profiling(
377
+ curr_step_profile and not next_step_profile
378
+ if self.config.global_profiler.profile_continuous_steps
379
+ else curr_step_profile
380
+ )
381
+ prev_step_profile = curr_step_profile
382
+ curr_step_profile = next_step_profile
383
+
384
+ # collect metrics
385
+ metrics.update(compute_data_metrics(batch=batch, use_critic=self.use_critic))
386
+ metrics.update(compute_timing_metrics(batch=batch, timing_raw=timing_raw))
387
+ # TODO: implement actual tflpo and theoretical tflpo
388
+ n_gpus = self.resource_pool_manager.get_n_gpus()
389
+ metrics.update(compute_throughout_metrics(batch=batch, timing_raw=timing_raw, n_gpus=n_gpus))
390
+ timing_raw = defaultdict(float) # clear timing
391
+
392
+ metrics["train/num_gen_batches"] = num_gen_batches
393
+ batch = None
394
+ num_prompt_in_batch = 0
395
+ num_gen_batches = 0
396
+
397
+ # TODO: make a canonical logger that supports various backend
398
+ logger.log(data=metrics, step=self.global_steps)
399
+
400
+ if is_last_step:
401
+ if hasattr(self.actor_rollout_wg, "async_calls_finalize_fn_exec"):
402
+ self.actor_rollout_wg.async_calls_finalize_fn_exec(blocking=True)
403
+ pprint(f"Final validation metrics: {last_val_metrics}")
404
+ progress_bar.close()
405
+ return
406
+
407
+ progress_bar.update(1)
408
+ self.global_steps += 1
409
+ self.gen_steps += 1
410
+ # check if last step checkpint exists
411
+ checkpoint_dir = os.path.join(self.config.trainer.default_local_dir, f"global_step_{self.global_steps}")
412
+ if not os.path.exists(checkpoint_dir):
413
+ # save last step checkpoint
414
+ timing_raw = defaultdict(float)
415
+ with marked_timer("save_checkpoint", timing_raw, "green"):
416
+ self._save_checkpoint()
417
+ metrics = {f"timing/{k}": v for k, v in timing_raw.items()}
418
+ logger.log(data=metrics, step=self.global_steps)
ICL/DAPO/verl-recipe/dapo/main_dapo.py ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2024 Bytedance Ltd. and/or its affiliates
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """
15
+ Note that we don't combine the main with ray_trainer as ray_trainer is used by other main.
16
+ """
17
+
18
+ import os
19
+ import socket
20
+
21
+ import hydra
22
+ import ray
23
+ from omegaconf import OmegaConf
24
+
25
+ from verl.trainer.constants_ppo import get_ppo_ray_runtime_env
26
+ from verl.trainer.ppo.reward import load_reward_manager
27
+ from verl.utils.device import auto_set_device, is_cuda_available
28
+
29
+ from .dapo_ray_trainer import RayDAPOTrainer
30
+
31
+
32
+ @hydra.main(config_path="config", config_name="dapo_trainer", version_base=None)
33
+ def main(config):
34
+ # Automatically set `config.trainer.device = npu` when running on Ascend NPU.
35
+ auto_set_device(config)
36
+
37
+ run_ppo(config)
38
+
39
+
40
+ def run_ppo(config) -> None:
41
+ if not ray.is_initialized():
42
+ # this is for local ray cluster
43
+ default_runtime_env = get_ppo_ray_runtime_env()
44
+ ray_init_kwargs = config.ray_kwargs.get("ray_init", {})
45
+ runtime_env_kwargs = ray_init_kwargs.get("runtime_env", {})
46
+ runtime_env = OmegaConf.merge(default_runtime_env, runtime_env_kwargs)
47
+ ray_init_kwargs = OmegaConf.create({**ray_init_kwargs, "runtime_env": runtime_env})
48
+ print(f"ray init kwargs: {ray_init_kwargs}")
49
+ ray.init(**OmegaConf.to_container(ray_init_kwargs))
50
+
51
+ try:
52
+ if (
53
+ is_cuda_available
54
+ and config.global_profiler.tool == "nsys"
55
+ and OmegaConf.select(config.global_profiler, "steps") is not None
56
+ and len(OmegaConf.select(config.global_profiler, "steps")) > 0
57
+ ):
58
+ nsight_options = OmegaConf.to_container(
59
+ config.global_profiler.global_tool_config.nsys.controller_nsight_options
60
+ )
61
+ runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
62
+ else:
63
+ runner = TaskRunner.remote()
64
+ ray.get(runner.run.remote(config))
65
+ finally:
66
+ if ray.is_initialized():
67
+ ray.shutdown()
68
+
69
+
70
+ @ray.remote(num_cpus=1) # please make sure main_task is not scheduled on head
71
+ class TaskRunner:
72
+ def run(self, config):
73
+ # print initial config
74
+ from pprint import pprint
75
+
76
+ from omegaconf import OmegaConf
77
+
78
+ from verl.utils.fs import copy_to_local
79
+
80
+ print(f"TaskRunner hostname: {socket.gethostname()}, PID: {os.getpid()}")
81
+
82
+ pprint(OmegaConf.to_container(config, resolve=True)) # resolve=True will eval symbol values
83
+ OmegaConf.resolve(config)
84
+
85
+ # download the checkpoint from hdfs
86
+ local_path = copy_to_local(config.actor_rollout_ref.model.path)
87
+
88
+ # instantiate tokenizer
89
+ from verl.utils import hf_processor, hf_tokenizer
90
+
91
+ trust_remote_code = config.data.get("trust_remote_code", False)
92
+ tokenizer = hf_tokenizer(local_path, trust_remote_code=trust_remote_code)
93
+ # used for multimodal LLM, could be none
94
+ processor = hf_processor(local_path, trust_remote_code=trust_remote_code, use_fast=True)
95
+
96
+ from verl.single_controller.ray import RayWorkerGroup
97
+
98
+ # define worker classes
99
+ if config.actor_rollout_ref.actor.strategy in {"fsdp", "fsdp2"}:
100
+ assert config.critic.strategy in {"fsdp", "fsdp2"}
101
+
102
+ from verl.workers.fsdp_workers import AsyncActorRolloutRefWorker, CriticWorker
103
+
104
+ ray_worker_group_cls = RayWorkerGroup
105
+
106
+ elif config.actor_rollout_ref.actor.strategy == "megatron":
107
+ assert config.actor_rollout_ref.actor.strategy == config.critic.strategy
108
+ from verl.workers.megatron_workers import AsyncActorRolloutRefWorker, CriticWorker
109
+
110
+ ray_worker_group_cls = RayWorkerGroup
111
+
112
+ else:
113
+ raise NotImplementedError
114
+
115
+ from verl.trainer.ppo.ray_trainer import ResourcePoolManager, Role
116
+
117
+ role_worker_mapping = {
118
+ Role.ActorRollout: ray.remote(AsyncActorRolloutRefWorker),
119
+ Role.Critic: ray.remote(CriticWorker),
120
+ }
121
+
122
+ global_pool_id = "global_pool"
123
+ resource_pool_spec = {
124
+ global_pool_id: [config.trainer.n_gpus_per_node] * config.trainer.nnodes,
125
+ }
126
+ mapping = {
127
+ Role.ActorRollout: global_pool_id,
128
+ Role.Critic: global_pool_id,
129
+ }
130
+
131
+ # we should adopt a multi-source reward function here
132
+ # - for rule-based rm, we directly call a reward score
133
+ # - for model-based rm, we call a model
134
+ # - for code related prompt, we send to a sandbox if there are test cases
135
+ # - finally, we combine all the rewards together
136
+ # - The reward type depends on the tag of the data
137
+ if config.reward_model.enable:
138
+ if config.reward_model.strategy in {"fsdp", "fsdp2"}:
139
+ from verl.workers.fsdp_workers import RewardModelWorker
140
+ elif config.reward_model.strategy == "megatron":
141
+ from verl.workers.megatron_workers import RewardModelWorker
142
+ else:
143
+ raise NotImplementedError
144
+ role_worker_mapping[Role.RewardModel] = ray.remote(RewardModelWorker)
145
+ mapping[Role.RewardModel] = global_pool_id
146
+
147
+ # reference model
148
+ if config.algorithm.use_kl_in_reward or config.actor_rollout_ref.actor.use_kl_loss:
149
+ role_worker_mapping[Role.RefPolicy] = ray.remote(AsyncActorRolloutRefWorker)
150
+ mapping[Role.RefPolicy] = global_pool_id
151
+
152
+ reward_fn = load_reward_manager(
153
+ config,
154
+ tokenizer,
155
+ 0,
156
+ max_resp_len=config.data.max_response_length,
157
+ overlong_buffer_cfg=config.reward_model.overlong_buffer,
158
+ )
159
+
160
+ # Note that we always use function-based RM for validation
161
+ val_reward_fn = load_reward_manager(
162
+ config,
163
+ tokenizer,
164
+ 1,
165
+ max_resp_len=config.data.max_response_length,
166
+ overlong_buffer_cfg=config.reward_model.overlong_buffer,
167
+ )
168
+ resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping)
169
+
170
+ trainer = RayDAPOTrainer(
171
+ config=config,
172
+ tokenizer=tokenizer,
173
+ processor=processor,
174
+ role_worker_mapping=role_worker_mapping,
175
+ resource_pool_manager=resource_pool_manager,
176
+ ray_worker_group_cls=ray_worker_group_cls,
177
+ reward_fn=reward_fn,
178
+ val_reward_fn=val_reward_fn,
179
+ )
180
+ trainer.init_workers()
181
+ trainer.fit()
182
+
183
+
184
+ if __name__ == "__main__":
185
+ main()
ICL/DAPO/verl-recipe/dapo/prepare_dapo_data.sh ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -uxo pipefail
3
+
4
+ export VERL_HOME=${VERL_HOME:-"${HOME}/verl"}
5
+ export TRAIN_FILE=${TRAIN_FILE:-"${VERL_HOME}/data/dapo-math-17k.parquet"}
6
+ export TEST_FILE=${TEST_FILE:-"${VERL_HOME}/data/aime-2024.parquet"}
7
+ export OVERWRITE=${OVERWRITE:-0}
8
+
9
+ mkdir -p "${VERL_HOME}/data"
10
+
11
+ if [ ! -f "${TRAIN_FILE}" ] || [ "${OVERWRITE}" -eq 1 ]; then
12
+ wget -O "${TRAIN_FILE}" "https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k/resolve/main/data/dapo-math-17k.parquet?download=true"
13
+ fi
14
+
15
+ if [ ! -f "${TEST_FILE}" ] || [ "${OVERWRITE}" -eq 1 ]; then
16
+ wget -O "${TEST_FILE}" "https://huggingface.co/datasets/BytedTsinghua-SIA/AIME-2024/resolve/main/data/aime-2024.parquet?download=true"
17
+ fi
ICL/DAPO/verl-recipe/dapo/run dapo_qwen2.5_vl_32b_fsdp2_npu.sh ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ export VLLM_USE_V1=1
5
+ export HCCL_CONNECT_TIMEOUT=5400
6
+ export VLLM_ASCEND_ENABLE_NZ=0
7
+ export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2
8
+ # Some models are optimized by vllm ascend. While in some case, e.g. rlhf training,
9
+ # the optimized model may not be suitable. In this case, set this value to 0 to disable the optimized model.
10
+ export USE_OPTIMIZED_MODEL=0
11
+
12
+ project_name='DAPO'
13
+ exp_name='DAPO-Qwen2.5-vl-32B'
14
+
15
+ adv_estimator=grpo
16
+
17
+ use_kl_in_reward=False
18
+ kl_coef=0.0
19
+ use_kl_loss=False
20
+ kl_loss_coef=0.0
21
+
22
+ clip_ratio_low=0.2
23
+ clip_ratio_high=0.28
24
+
25
+ max_prompt_length=1024
26
+ max_response_length=2048
27
+ enable_overlong_buffer=False
28
+ overlong_buffer_len=$((1024 * 2))
29
+ overlong_penalty_factor=1.0
30
+
31
+ loss_agg_mode="token-mean"
32
+
33
+ enable_filter_groups=True
34
+ filter_groups_metric=acc
35
+ max_num_gen_batches=4
36
+ train_prompt_bsz=64
37
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
38
+ n_resp_per_prompt=8
39
+ train_prompt_mini_bsz=16
40
+
41
+ # Ray
42
+ PWD=./
43
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
44
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
45
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
46
+
47
+ # Paths
48
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
49
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-VL-32B-Instruct"}
50
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
51
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/geo3k/train.parquet"}
52
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/geo3k/test.parquet"}
53
+
54
+ # Algorithm
55
+ temperature=1.0
56
+ top_p=1.0
57
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
58
+ val_top_p=0.7
59
+
60
+ # Performance Related Parameter
61
+ sp_size=4
62
+ use_dynamic_bsz=True
63
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
64
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
65
+ gen_tp=4
66
+ fsdp_size=-1
67
+
68
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
69
+ --working-dir "${WORKING_DIR}" \
70
+ --address "${RAY_ADDRESS}" \
71
+ -- python3 -m recipe.dapo.main_dapo \
72
+ data.train_files="${TRAIN_FILE}" \
73
+ data.val_files="${TEST_FILE}" \
74
+ data.prompt_key=prompt \
75
+ data.truncation='left' \
76
+ data.max_prompt_length=${max_prompt_length} \
77
+ data.max_response_length=${max_response_length} \
78
+ data.gen_batch_size=${gen_prompt_bsz} \
79
+ data.train_batch_size=${train_prompt_bsz} \
80
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
81
+ algorithm.adv_estimator=${adv_estimator} \
82
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
83
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
84
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
85
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
86
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
87
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
88
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
89
+ algorithm.filter_groups.enable=${enable_filter_groups} \
90
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
91
+ algorithm.filter_groups.metric=${filter_groups_metric} \
92
+ actor_rollout_ref.model.use_remove_padding=True \
93
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
94
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
95
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
96
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
97
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
98
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
99
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
100
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
101
+ actor_rollout_ref.actor.optim.lr=1e-6 \
102
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
103
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
104
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
105
+ actor_rollout_ref.actor.fsdp_config.param_offload=True \
106
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
107
+ actor_rollout_ref.actor.use_torch_compile=False \
108
+ actor_rollout_ref.actor.entropy_coeff=0 \
109
+ actor_rollout_ref.actor.grad_clip=1.0 \
110
+ actor_rollout_ref.rollout.enforce_eager=True \
111
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
112
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
113
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.60 \
114
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
115
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
116
+ actor_rollout_ref.rollout.temperature=${temperature} \
117
+ actor_rollout_ref.rollout.top_p=${top_p} \
118
+ actor_rollout_ref.rollout.top_k="${top_k}" \
119
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
120
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
121
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
122
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
123
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
124
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
125
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
126
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
127
+ actor_rollout_ref.rollout.name=vllm \
128
+ +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True \
129
+ actor_rollout_ref.actor.strategy=fsdp2 \
130
+ actor_rollout_ref.ref.strategy=fsdp2 \
131
+ critic.strategy=fsdp2 \
132
+ actor_rollout_ref.ref.fsdp_config.param_offload=True \
133
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
134
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
135
+ reward_model.reward_manager=dapo \
136
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
137
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
138
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
139
+ trainer.logger=console \
140
+ trainer.project_name="${project_name}" \
141
+ trainer.experiment_name="${exp_name}" \
142
+ trainer.n_gpus_per_node=16 \
143
+ trainer.nnodes=1 \
144
+ trainer.val_before_train=True \
145
+ trainer.test_freq=1 \
146
+ trainer.save_freq=20 \
147
+ trainer.resume_mode=auto \
148
+ trainer.device=npu \
149
+ trainer.total_epochs=30 \
150
+ trainer.total_training_steps=100 \
151
+ trainer.default_local_dir="${CKPTS_DIR}"
ICL/DAPO/verl-recipe/dapo/run dapo_qwen2.5_vl_3b_fsdp2_npu.sh ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ export VLLM_USE_V1=1
5
+ export HCCL_CONNECT_TIMEOUT=5400
6
+ export VLLM_ASCEND_ENABLE_NZ=0
7
+ export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2
8
+ # Some models are optimized by vllm ascend. While in some case, e.g. rlhf training,
9
+ # the optimized model may not be suitable. In this case, set this value to 0 to disable the optimized model.
10
+ export USE_OPTIMIZED_MODEL=0
11
+
12
+
13
+ project_name='DAPO'
14
+ exp_name='DAPO-Qwen2.5-vl-3B'
15
+
16
+ adv_estimator=grpo
17
+
18
+ use_kl_in_reward=False
19
+ kl_coef=0.0
20
+ use_kl_loss=False
21
+ kl_loss_coef=0.0
22
+
23
+ clip_ratio_low=0.2
24
+ clip_ratio_high=0.28
25
+
26
+ max_prompt_length=1024
27
+ max_response_length=2048
28
+ enable_overlong_buffer=False
29
+ overlong_buffer_len=$((1024 * 2))
30
+ overlong_penalty_factor=1.0
31
+
32
+ loss_agg_mode="token-mean"
33
+
34
+ enable_filter_groups=True
35
+ filter_groups_metric=acc
36
+ max_num_gen_batches=4
37
+ train_prompt_bsz=64
38
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
39
+ n_resp_per_prompt=8
40
+ train_prompt_mini_bsz=16
41
+
42
+ # Ray
43
+ PWD=./
44
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
45
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
46
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
47
+
48
+ # Paths
49
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
50
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-VL-3B-Instruct"}
51
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
52
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/geo3k/train.parquet"}
53
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/geo3k/test.parquet"}
54
+
55
+ # Algorithm
56
+ temperature=1.0
57
+ top_p=1.0
58
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
59
+ val_top_p=0.7
60
+
61
+ # Performance Related Parameter
62
+ sp_size=1
63
+ use_dynamic_bsz=True
64
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
65
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
66
+ offload=True
67
+ gen_tp=1
68
+ fsdp_size=-1
69
+
70
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
71
+ --working-dir "${WORKING_DIR}" \
72
+ --address "${RAY_ADDRESS}" \
73
+ -- python3 -m recipe.dapo.main_dapo \
74
+ data.train_files="${TRAIN_FILE}" \
75
+ data.val_files="${TEST_FILE}" \
76
+ data.prompt_key=prompt \
77
+ data.truncation='left' \
78
+ data.max_prompt_length=${max_prompt_length} \
79
+ data.max_response_length=${max_response_length} \
80
+ data.gen_batch_size=${gen_prompt_bsz} \
81
+ data.train_batch_size=${train_prompt_bsz} \
82
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
83
+ algorithm.adv_estimator=${adv_estimator} \
84
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
85
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
86
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
87
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
88
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
89
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
90
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
91
+ algorithm.filter_groups.enable=${enable_filter_groups} \
92
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
93
+ algorithm.filter_groups.metric=${filter_groups_metric} \
94
+ actor_rollout_ref.model.use_remove_padding=True \
95
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
96
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
97
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
98
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
99
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
100
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
101
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
102
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
103
+ actor_rollout_ref.actor.optim.lr=1e-6 \
104
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
105
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
106
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
107
+ actor_rollout_ref.actor.fsdp_config.param_offload=False \
108
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
109
+ actor_rollout_ref.actor.use_torch_compile=False \
110
+ actor_rollout_ref.actor.entropy_coeff=0 \
111
+ actor_rollout_ref.actor.grad_clip=1.0 \
112
+ actor_rollout_ref.rollout.enforce_eager=True \
113
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
114
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
115
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.60 \
116
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
117
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
118
+ actor_rollout_ref.rollout.temperature=${temperature} \
119
+ actor_rollout_ref.rollout.top_p=${top_p} \
120
+ actor_rollout_ref.rollout.top_k="${top_k}" \
121
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
122
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
123
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
124
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
125
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
126
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
127
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
128
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
129
+ actor_rollout_ref.rollout.name=vllm \
130
+ +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True \
131
+ actor_rollout_ref.actor.strategy=fsdp2 \
132
+ actor_rollout_ref.ref.strategy=fsdp2 \
133
+ critic.strategy=fsdp2 \
134
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
135
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
136
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
137
+ reward_model.reward_manager=dapo \
138
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
139
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
140
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
141
+ trainer.logger=console \
142
+ trainer.project_name="${project_name}" \
143
+ trainer.experiment_name="${exp_name}" \
144
+ trainer.n_gpus_per_node=8 \
145
+ trainer.nnodes=1 \
146
+ trainer.val_before_train=True \
147
+ trainer.test_freq=1 \
148
+ trainer.save_freq=20 \
149
+ trainer.device=npu \
150
+ trainer.resume_mode=auto \
151
+ trainer.total_epochs=30 \
152
+ trainer.total_training_steps=100 \
153
+ trainer.default_local_dir="${CKPTS_DIR}"
154
+
ICL/DAPO/verl-recipe/dapo/run dapo_qwen2.5_vl_7b_fsdp2_npu.sh ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ export VLLM_USE_V1=1
5
+ export HCCL_CONNECT_TIMEOUT=5400
6
+ export VLLM_ASCEND_ENABLE_NZ=0
7
+ export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2
8
+
9
+ # Some models are optimized by vllm ascend. While in some case, e.g. rlhf training,
10
+ # the optimized model may not be suitable. In this case, set this value to 0 to disable the optimized model.
11
+ export USE_OPTIMIZED_MODEL=0
12
+
13
+
14
+ project_name='DAPO'
15
+ exp_name='DAPO-Qwen2.5-vl-7B'
16
+
17
+ adv_estimator=grpo
18
+
19
+ use_kl_in_reward=False
20
+ kl_coef=0.0
21
+ use_kl_loss=False
22
+ kl_loss_coef=0.0
23
+
24
+ clip_ratio_low=0.2
25
+ clip_ratio_high=0.28
26
+
27
+ max_prompt_length=1024
28
+ max_response_length=2048
29
+ enable_overlong_buffer=False
30
+ overlong_buffer_len=$((1024 * 2))
31
+ overlong_penalty_factor=1.0
32
+
33
+ loss_agg_mode="token-mean"
34
+
35
+ enable_filter_groups=True
36
+ filter_groups_metric=acc
37
+ max_num_gen_batches=4
38
+ train_prompt_bsz=128
39
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
40
+ n_resp_per_prompt=8
41
+ train_prompt_mini_bsz=16
42
+
43
+ # Ray
44
+ PWD=./
45
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
46
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
47
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
48
+
49
+ # Paths
50
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
51
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-VL-7B-Instruct"}
52
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
53
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/geo3k/train.parquet"}
54
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/geo3k/test.parquet"}
55
+
56
+ # Algorithm
57
+ temperature=1.0
58
+ top_p=1.0
59
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
60
+ val_top_p=0.7
61
+
62
+ # Performance Related Parameter
63
+ sp_size=1
64
+ use_dynamic_bsz=True
65
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
66
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
67
+ gen_tp=1
68
+ fsdp_size=-1
69
+
70
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
71
+ --working-dir "${WORKING_DIR}" \
72
+ --address "${RAY_ADDRESS}" \
73
+ -- python3 -m recipe.dapo.main_dapo \
74
+ data.train_files="${TRAIN_FILE}" \
75
+ data.val_files="${TEST_FILE}" \
76
+ data.prompt_key=prompt \
77
+ data.truncation='left' \
78
+ data.max_prompt_length=${max_prompt_length} \
79
+ data.max_response_length=${max_response_length} \
80
+ data.gen_batch_size=${gen_prompt_bsz} \
81
+ data.train_batch_size=${train_prompt_bsz} \
82
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
83
+ algorithm.adv_estimator=${adv_estimator} \
84
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
85
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
86
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
87
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
88
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
89
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
90
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
91
+ algorithm.filter_groups.enable=${enable_filter_groups} \
92
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
93
+ algorithm.filter_groups.metric=${filter_groups_metric} \
94
+ actor_rollout_ref.model.use_remove_padding=True \
95
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
96
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
97
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
98
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
99
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
100
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
101
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
102
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
103
+ actor_rollout_ref.actor.optim.lr=1e-6 \
104
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
105
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
106
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
107
+ actor_rollout_ref.actor.fsdp_config.param_offload=False \
108
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
109
+ actor_rollout_ref.actor.use_torch_compile=False \
110
+ actor_rollout_ref.actor.entropy_coeff=0 \
111
+ actor_rollout_ref.actor.grad_clip=1.0 \
112
+ actor_rollout_ref.rollout.enforce_eager=True \
113
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
114
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
115
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.60 \
116
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
117
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
118
+ actor_rollout_ref.rollout.temperature=${temperature} \
119
+ actor_rollout_ref.rollout.top_p=${top_p} \
120
+ actor_rollout_ref.rollout.top_k="${top_k}" \
121
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
122
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
123
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
124
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
125
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
126
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
127
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
128
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
129
+ actor_rollout_ref.rollout.name=vllm \
130
+ +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True \
131
+ actor_rollout_ref.actor.strategy=fsdp2 \
132
+ actor_rollout_ref.ref.strategy=fsdp2 \
133
+ critic.strategy=fsdp2 \
134
+ actor_rollout_ref.ref.fsdp_config.param_offload=True \
135
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
136
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
137
+ reward_model.reward_manager=dapo \
138
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
139
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
140
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
141
+ trainer.logger=console \
142
+ trainer.project_name="${project_name}" \
143
+ trainer.experiment_name="${exp_name}" \
144
+ trainer.n_gpus_per_node=8 \
145
+ trainer.nnodes=1 \
146
+ trainer.val_before_train=True \
147
+ trainer.test_freq=1 \
148
+ trainer.save_freq=20 \
149
+ trainer.resume_mode=auto \
150
+ trainer.device=npu \
151
+ trainer.total_epochs=30 \
152
+ trainer.total_training_steps=100 \
153
+ trainer.default_local_dir="${CKPTS_DIR}"
ICL/DAPO/verl-recipe/dapo/run dapo_qwen3_vl_30b_fsdp2_npu.sh ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ export VLLM_USE_V1=1
5
+ export HCCL_CONNECT_TIMEOUT=5400
6
+ export VLLM_ASCEND_ENABLE_NZ=0
7
+ export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2
8
+ # Some models are optimized by vllm ascend. While in some case, e.g. rlhf training,
9
+ # the optimized model may not be suitable. In this case, set this value to 0 to disable the optimized model.
10
+ export USE_OPTIMIZED_MODEL=0
11
+
12
+ project_name='DAPO'
13
+ exp_name='DAPO-Qwen3-vl-30B'
14
+
15
+ adv_estimator=grpo
16
+
17
+ use_kl_in_reward=False
18
+ kl_coef=0.0
19
+ use_kl_loss=False
20
+ kl_loss_coef=0.0
21
+
22
+ clip_ratio_low=0.2
23
+ clip_ratio_high=0.28
24
+
25
+ max_prompt_length=1024
26
+ max_response_length=2048
27
+ enable_overlong_buffer=False
28
+ overlong_buffer_len=$((1024 * 2))
29
+ overlong_penalty_factor=1.0
30
+
31
+ loss_agg_mode="token-mean"
32
+
33
+ enable_filter_groups=True
34
+ filter_groups_metric=acc
35
+ max_num_gen_batches=4
36
+ train_prompt_bsz=64
37
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
38
+ n_resp_per_prompt=8
39
+ train_prompt_mini_bsz=16
40
+
41
+ # Ray
42
+ PWD=./
43
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
44
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
45
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
46
+
47
+ # Paths
48
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
49
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-VL-30B-A3B-Instruct"}
50
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
51
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/geo3k/train.parquet"}
52
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/geo3k/test.parquet"}
53
+
54
+ # Algorithm
55
+ temperature=1.0
56
+ top_p=1.0
57
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
58
+ val_top_p=0.7
59
+
60
+ # Performance Related Parameter
61
+ sp_size=8
62
+ use_dynamic_bsz=True
63
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
64
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
65
+ gen_tp=8
66
+ fsdp_size=16
67
+
68
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
69
+ --working-dir "${WORKING_DIR}" \
70
+ --address "${RAY_ADDRESS}" \
71
+ -- python3 -m recipe.dapo.main_dapo \
72
+ data.train_files="${TRAIN_FILE}" \
73
+ data.val_files="${TEST_FILE}" \
74
+ data.prompt_key=prompt \
75
+ data.truncation='left' \
76
+ data.max_prompt_length=${max_prompt_length} \
77
+ data.max_response_length=${max_response_length} \
78
+ data.gen_batch_size=${gen_prompt_bsz} \
79
+ data.train_batch_size=${train_prompt_bsz} \
80
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
81
+ algorithm.adv_estimator=${adv_estimator} \
82
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
83
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
84
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
85
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
86
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
87
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
88
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
89
+ algorithm.filter_groups.enable=${enable_filter_groups} \
90
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
91
+ algorithm.filter_groups.metric=${filter_groups_metric} \
92
+ actor_rollout_ref.model.use_remove_padding=True \
93
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
94
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
95
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
96
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
97
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
98
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
99
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
100
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
101
+ actor_rollout_ref.actor.optim.lr=1e-6 \
102
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
103
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
104
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
105
+ actor_rollout_ref.actor.fsdp_config.param_offload=False \
106
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
107
+ actor_rollout_ref.actor.use_torch_compile=False \
108
+ actor_rollout_ref.actor.entropy_coeff=0 \
109
+ actor_rollout_ref.actor.grad_clip=1.0 \
110
+ actor_rollout_ref.rollout.enforce_eager=True \
111
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
112
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
113
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.70 \
114
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
115
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
116
+ actor_rollout_ref.rollout.temperature=${temperature} \
117
+ actor_rollout_ref.rollout.top_p=${top_p} \
118
+ actor_rollout_ref.rollout.top_k="${top_k}" \
119
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
120
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
121
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
122
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
123
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
124
+ actor_rollout_ref.rollout.expert_parallel_size=8 \
125
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
126
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
127
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
128
+ actor_rollout_ref.rollout.name=vllm \
129
+ +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True \
130
+ actor_rollout_ref.actor.strategy=fsdp2 \
131
+ actor_rollout_ref.ref.strategy=fsdp2 \
132
+ critic.strategy=fsdp2 \
133
+ actor_rollout_ref.ref.fsdp_config.param_offload=True \
134
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
135
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
136
+ reward_model.reward_manager=dapo \
137
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
138
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
139
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
140
+ trainer.logger=console \
141
+ trainer.project_name="${project_name}" \
142
+ trainer.experiment_name="${exp_name}" \
143
+ trainer.n_gpus_per_node=8 \
144
+ trainer.nnodes=2 \
145
+ trainer.val_before_train=True \
146
+ trainer.test_freq=1 \
147
+ trainer.save_freq=20 \
148
+ trainer.resume_mode=auto \
149
+ trainer.device=npu \
150
+ trainer.total_epochs=30 \
151
+ trainer.total_training_steps=100 \
152
+ trainer.default_local_dir="${CKPTS_DIR}"
ICL/DAPO/verl-recipe/dapo/run_dapo_early_qwen2.5_32b.sh ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Early-Qwen2.5-32B'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+
14
+ clip_ratio_low=0.2
15
+ clip_ratio_high=0.28
16
+
17
+ max_prompt_length=$((1024 * 2))
18
+ max_response_length=$((1024 * 20))
19
+ enable_overlong_buffer=True
20
+ overlong_buffer_len=$((1024 * 4))
21
+ overlong_penalty_factor=1.0
22
+
23
+ # An early version for DAPO
24
+ loss_agg_mode="seq-mean-token-mean"
25
+
26
+ enable_filter_groups=False
27
+ gen_prompt_bsz=512 # NOTE: no filtering here
28
+ train_prompt_bsz=512
29
+ train_prompt_mini_bsz=32
30
+ n_resp_per_prompt=16
31
+
32
+ # Ray
33
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
34
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
35
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
36
+ NNODES=${NNODES:-16}
37
+ # Paths
38
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
39
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-32B"}
40
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
41
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
42
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
43
+
44
+ # Algorithm
45
+ temperature=1.0
46
+ top_p=1.0
47
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
48
+ val_top_p=0.7
49
+
50
+
51
+ # Performance Related Parameter
52
+ sp_size=8
53
+ use_dynamic_bsz=True
54
+ actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
55
+ infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
56
+ offload=True
57
+ gen_tp=4
58
+
59
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
60
+ --working-dir "${WORKING_DIR}" \
61
+ -- python3 -m recipe.dapo.main_dapo \
62
+ data.train_files="${TRAIN_FILE}" \
63
+ data.val_files="${TEST_FILE}" \
64
+ data.prompt_key=prompt \
65
+ data.truncation='left' \
66
+ data.max_prompt_length=${max_prompt_length} \
67
+ data.max_response_length=${max_response_length} \
68
+ data.gen_batch_size=${gen_prompt_bsz} \
69
+ data.train_batch_size=${train_prompt_bsz} \
70
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
71
+ algorithm.adv_estimator=${adv_estimator} \
72
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
73
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
74
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
75
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
76
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
77
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
78
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
79
+ algorithm.filter_groups.enable=${enable_filter_groups} \
80
+ actor_rollout_ref.model.use_remove_padding=True \
81
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
82
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
83
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
84
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
85
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
86
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
87
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
88
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
89
+ actor_rollout_ref.actor.optim.lr=1e-6 \
90
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
91
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
92
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
93
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
94
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
95
+ actor_rollout_ref.actor.entropy_coeff=0 \
96
+ actor_rollout_ref.actor.grad_clip=1.0 \
97
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
98
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
99
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
100
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
101
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
102
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
103
+ actor_rollout_ref.rollout.temperature=${temperature} \
104
+ actor_rollout_ref.rollout.top_p=${top_p} \
105
+ actor_rollout_ref.rollout.top_k="${top_k}" \
106
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
107
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
108
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
109
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
110
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
111
+ actor_rollout_ref.rollout.name=vllm \
112
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
113
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
114
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
115
+ reward_model.reward_manager=dapo \
116
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
117
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
118
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
119
+ trainer.logger='["console","wandb"]' \
120
+ trainer.project_name="${project_name}" \
121
+ trainer.experiment_name="${exp_name}" \
122
+ trainer.n_gpus_per_node=8 \
123
+ trainer.nnodes="${NNODES}" \
124
+ trainer.val_before_train=True \
125
+ trainer.test_freq=5 \
126
+ trainer.save_freq=5 \
127
+ trainer.total_epochs=1 \
128
+ trainer.default_local_dir="${CKPTS_DIR}" \
129
+ trainer.resume_mode=auto
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b.sh ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Qwen2.5-32B'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+
14
+ clip_ratio_low=0.2
15
+ clip_ratio_high=0.28
16
+
17
+ max_prompt_length=$((1024 * 2))
18
+ max_response_length=$((1024 * 20))
19
+ enable_overlong_buffer=True
20
+ overlong_buffer_len=$((1024 * 4))
21
+ overlong_penalty_factor=1.0
22
+
23
+ loss_agg_mode="token-mean"
24
+
25
+ enable_filter_groups=True
26
+ filter_groups_metric=acc
27
+ max_num_gen_batches=10
28
+ train_prompt_bsz=512
29
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
30
+ n_resp_per_prompt=16
31
+ train_prompt_mini_bsz=32
32
+
33
+ # Ray
34
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
35
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
36
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
37
+ NNODES=${NNODES:-16}
38
+ # Paths
39
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
40
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-32B"}
41
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
42
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
43
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
44
+
45
+ # Algorithm
46
+ temperature=1.0
47
+ top_p=1.0
48
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
49
+ val_top_p=0.7
50
+
51
+ # Performance Related Parameter
52
+ sp_size=8
53
+ use_dynamic_bsz=True
54
+ actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
55
+ infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
56
+ offload=True
57
+ gen_tp=4
58
+
59
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
60
+ --working-dir "${WORKING_DIR}" \
61
+ -- python3 -m recipe.dapo.main_dapo \
62
+ data.train_files="${TRAIN_FILE}" \
63
+ data.val_files="${TEST_FILE}" \
64
+ data.prompt_key=prompt \
65
+ data.truncation='left' \
66
+ data.max_prompt_length=${max_prompt_length} \
67
+ data.max_response_length=${max_response_length} \
68
+ data.gen_batch_size=${gen_prompt_bsz} \
69
+ data.train_batch_size=${train_prompt_bsz} \
70
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
71
+ algorithm.adv_estimator=${adv_estimator} \
72
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
73
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
74
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
75
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
76
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
77
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
78
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
79
+ algorithm.filter_groups.enable=${enable_filter_groups} \
80
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
81
+ algorithm.filter_groups.metric=${filter_groups_metric} \
82
+ actor_rollout_ref.model.use_remove_padding=True \
83
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
84
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
85
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
86
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
87
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
88
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
89
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
90
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
91
+ actor_rollout_ref.actor.optim.lr=1e-6 \
92
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
93
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
94
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
95
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
96
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
97
+ actor_rollout_ref.actor.entropy_coeff=0 \
98
+ actor_rollout_ref.actor.grad_clip=1.0 \
99
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
100
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
101
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
102
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
103
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
104
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
105
+ actor_rollout_ref.rollout.temperature=${temperature} \
106
+ actor_rollout_ref.rollout.top_p=${top_p} \
107
+ actor_rollout_ref.rollout.top_k="${top_k}" \
108
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
109
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
110
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
111
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
112
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
113
+ actor_rollout_ref.rollout.name=vllm \
114
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
115
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
116
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
117
+ reward_model.reward_manager=dapo \
118
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
119
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
120
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
121
+ trainer.logger='["console","wandb"]' \
122
+ trainer.project_name="${project_name}" \
123
+ trainer.experiment_name="${exp_name}" \
124
+ trainer.n_gpus_per_node=8 \
125
+ trainer.nnodes="${NNODES}" \
126
+ trainer.val_before_train=True \
127
+ trainer.test_freq=5 \
128
+ trainer.save_freq=5 \
129
+ trainer.total_epochs=1 \
130
+ trainer.default_local_dir="${CKPTS_DIR}" \
131
+ trainer.resume_mode=auto
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b_fsdp2_20k_npu.sh ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ export VLLM_USE_V1=1
5
+ export HCCL_OP_EXPANSION_MODE="AIV"
6
+ export VLLM_ASCEND_ENABLE_FLASHCOMM=1
7
+ export HCCL_EXEC_TIMEOUT=3600
8
+ export HCCL_CONNECT_TIMEOUT=3600
9
+
10
+ project_name='DAPO'
11
+ exp_name='DAPO-Qwen2.5-32B'
12
+
13
+ adv_estimator=grpo
14
+
15
+ use_kl_in_reward=False
16
+ kl_coef=0.0
17
+ use_kl_loss=False
18
+ kl_loss_coef=0.0
19
+
20
+ clip_ratio_low=0.2
21
+ clip_ratio_high=0.28
22
+
23
+ max_prompt_length=$((1024 * 2))
24
+ max_response_length=$((1024 * 20))
25
+ enable_overlong_buffer=True
26
+ overlong_buffer_len=$((1024 * 4))
27
+ overlong_penalty_factor=1.0
28
+
29
+ loss_agg_mode="token-mean"
30
+
31
+ enable_filter_groups=True
32
+ filter_groups_metric=acc
33
+ max_num_gen_batches=10
34
+ train_prompt_bsz=32
35
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
36
+ n_resp_per_prompt=16
37
+ train_prompt_mini_bsz=32
38
+
39
+ # Ray
40
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
41
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
42
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
43
+ NNODES=${NNODES:-1}
44
+ # Paths
45
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
46
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-32B"}
47
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
48
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
49
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
50
+
51
+ # Algorithm
52
+ temperature=1.0
53
+ top_p=1.0
54
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
55
+ val_top_p=0.7
56
+
57
+ # Performance Related Parameter
58
+ sp_size=8
59
+ use_dynamic_bsz=True
60
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
61
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
62
+ offload=True
63
+ gen_tp=4
64
+ gen_dp=1
65
+ enable_chunked_prefill=True
66
+
67
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
68
+ --working-dir "${WORKING_DIR}" \
69
+ -- python3 -m recipe.dapo.main_dapo \
70
+ data.train_files="${TRAIN_FILE}" \
71
+ data.val_files="${TEST_FILE}" \
72
+ data.prompt_key=prompt \
73
+ data.truncation='left' \
74
+ data.max_prompt_length=${max_prompt_length} \
75
+ data.max_response_length=${max_response_length} \
76
+ data.gen_batch_size=${gen_prompt_bsz} \
77
+ data.train_batch_size=${train_prompt_bsz} \
78
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
79
+ algorithm.adv_estimator=${adv_estimator} \
80
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
81
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
82
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
83
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
84
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
85
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
86
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
87
+ algorithm.filter_groups.enable=${enable_filter_groups} \
88
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
89
+ algorithm.filter_groups.metric=${filter_groups_metric} \
90
+ actor_rollout_ref.model.use_remove_padding=True \
91
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
92
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
93
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
94
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
95
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
96
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
97
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
98
+ +actor_rollout_ref.model.override_config.attention_dropout=0. \
99
+ +actor_rollout_ref.model.override_config.embd_pdrop=0. \
100
+ +actor_rollout_ref.model.override_config.resid_pdrop=0. \
101
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
102
+ actor_rollout_ref.actor.optim.lr=1e-6 \
103
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
104
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
105
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
106
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
107
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
108
+ actor_rollout_ref.actor.entropy_coeff=0 \
109
+ actor_rollout_ref.actor.grad_clip=1.0 \
110
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
111
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
112
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.60 \
113
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
114
+ actor_rollout_ref.rollout.data_parallel_size=${gen_dp} \
115
+ actor_rollout_ref.rollout.enable_chunked_prefill=${enable_chunked_prefill} \
116
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
117
+ actor_rollout_ref.rollout.temperature=${temperature} \
118
+ actor_rollout_ref.rollout.top_p=${top_p} \
119
+ actor_rollout_ref.rollout.top_k="${top_k}" \
120
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
121
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
122
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
123
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
124
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
125
+ actor_rollout_ref.rollout.name=vllm \
126
+ actor_rollout_ref.rollout.enforce_eager=True \
127
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
128
+ actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
129
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
130
+ actor_rollout_ref.actor.strategy=fsdp2 \
131
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
132
+ actor_rollout_ref.rollout.expert_parallel_size=1 \
133
+ reward_model.reward_manager=dapo \
134
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
135
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
136
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
137
+ trainer.logger='["console"]' \
138
+ trainer.project_name="${project_name}" \
139
+ trainer.experiment_name="${exp_name}" \
140
+ trainer.n_gpus_per_node=16 \
141
+ trainer.nnodes="${NNODES}" \
142
+ trainer.val_before_train=False \
143
+ trainer.test_freq=100 \
144
+ trainer.save_freq=100 \
145
+ trainer.total_epochs=100 \
146
+ trainer.default_local_dir="${CKPTS_DIR}" \
147
+ trainer.resume_mode=auto \
148
+ trainer.device='npu' \
149
+ actor_rollout_ref.actor.use_torch_compile=False \
150
+ actor_rollout_ref.ref.use_torch_compile=False $@
151
+
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b_fsdp2_4k_npu.sh ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ export VLLM_USE_V1=1
5
+ export HCCL_OP_EXPANSION_MODE="AIV"
6
+ export VLLM_ASCEND_ENABLE_FLASHCOMM=1
7
+ export HCCL_EXEC_TIMEOUT=3600
8
+ export HCCL_CONNECT_TIMEOUT=3600
9
+
10
+ project_name='DAPO'
11
+ exp_name='DAPO-Qwen2.5-32B'
12
+
13
+ adv_estimator=grpo
14
+
15
+ use_kl_in_reward=False
16
+ kl_coef=0.0
17
+ use_kl_loss=False
18
+ kl_loss_coef=0.0
19
+
20
+ clip_ratio_low=0.2
21
+ clip_ratio_high=0.28
22
+
23
+ max_prompt_length=$((1024 * 2))
24
+ max_response_length=$((1024 * 4))
25
+ min_response_length=$((1024 * 4))
26
+ enable_overlong_buffer=True
27
+ overlong_buffer_len=$((1024 * 4))
28
+ overlong_penalty_factor=1.0
29
+
30
+ loss_agg_mode="token-mean"
31
+
32
+ enable_filter_groups=False
33
+ filter_groups_metric=acc
34
+ max_num_gen_batches=10
35
+ train_prompt_bsz=32
36
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
37
+ n_resp_per_prompt=16
38
+ train_prompt_mini_bsz=32
39
+
40
+ # Ray
41
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
42
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
43
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
44
+ NNODES=${NNODES:-1}
45
+ # Paths
46
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
47
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-32B"}
48
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
49
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
50
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
51
+
52
+ # Algorithm
53
+ temperature=1.0
54
+ top_p=1.0
55
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
56
+ val_top_p=0.7
57
+
58
+ # Performance Related Parameter
59
+ sp_size=8
60
+ use_dynamic_bsz=True
61
+ actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
62
+ infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
63
+ offload=True
64
+ gen_tp=4
65
+ gen_dp=1
66
+ enable_chunked_prefill=True
67
+
68
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
69
+ --working-dir "${WORKING_DIR}" \
70
+ --address "${RAY_ADDRESS}" \
71
+ -- python3 -m recipe.dapo.main_dapo \
72
+ data.train_files="${TRAIN_FILE}" \
73
+ data.val_files="${TEST_FILE}" \
74
+ data.prompt_key=prompt \
75
+ data.truncation='left' \
76
+ data.max_prompt_length=${max_prompt_length} \
77
+ data.max_response_length=${max_response_length} \
78
+ +data.min_response_length=${min_response_length} \
79
+ data.gen_batch_size=${gen_prompt_bsz} \
80
+ data.train_batch_size=${train_prompt_bsz} \
81
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
82
+ algorithm.adv_estimator=${adv_estimator} \
83
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
84
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
85
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
86
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
87
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
88
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
89
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
90
+ algorithm.filter_groups.enable=${enable_filter_groups} \
91
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
92
+ algorithm.filter_groups.metric=${filter_groups_metric} \
93
+ actor_rollout_ref.model.use_remove_padding=True \
94
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
95
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
96
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
97
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
98
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
99
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
100
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
101
+ +actor_rollout_ref.model.override_config.attention_dropout=0. \
102
+ +actor_rollout_ref.model.override_config.embd_pdrop=0. \
103
+ +actor_rollout_ref.model.override_config.resid_pdrop=0. \
104
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
105
+ actor_rollout_ref.actor.optim.lr=1e-6 \
106
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
107
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
108
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
109
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
110
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
111
+ actor_rollout_ref.actor.entropy_coeff=0 \
112
+ actor_rollout_ref.actor.grad_clip=1.0 \
113
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
114
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
115
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.60 \
116
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
117
+ actor_rollout_ref.rollout.data_parallel_size=${gen_dp} \
118
+ actor_rollout_ref.rollout.enable_chunked_prefill=${enable_chunked_prefill} \
119
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
120
+ actor_rollout_ref.rollout.temperature=${temperature} \
121
+ actor_rollout_ref.rollout.top_p=${top_p} \
122
+ actor_rollout_ref.rollout.top_k="${top_k}" \
123
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
124
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
125
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
126
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
127
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
128
+ actor_rollout_ref.rollout.name=vllm \
129
+ actor_rollout_ref.rollout.enforce_eager=True \
130
+ actor_rollout_ref.rollout.ignore_eos=True \
131
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
132
+ actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
133
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
134
+ actor_rollout_ref.actor.strategy=fsdp2 \
135
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
136
+ actor_rollout_ref.rollout.expert_parallel_size=1 \
137
+ reward_model.reward_manager=dapo \
138
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
139
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
140
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
141
+ trainer.logger='["console"]' \
142
+ trainer.project_name="${project_name}" \
143
+ trainer.experiment_name="${exp_name}" \
144
+ trainer.n_gpus_per_node=16 \
145
+ trainer.nnodes="${NNODES}" \
146
+ trainer.val_before_train=False \
147
+ trainer.test_freq=100 \
148
+ trainer.save_freq=100 \
149
+ trainer.total_epochs=100 \
150
+ trainer.default_local_dir="${CKPTS_DIR}" \
151
+ trainer.resume_mode=auto \
152
+ trainer.device='npu' \
153
+ actor_rollout_ref.actor.use_torch_compile=False \
154
+ actor_rollout_ref.ref.use_torch_compile=False $@
155
+
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b_npu.sh ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO-Qwen2.5-32B'
5
+ exp_name='Qwen2.5-32B-npu-32rank-gbs128'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+ clip_ratio_low=0.2
14
+ clip_ratio_high=0.28
15
+ max_prompt_length=$((1024 * 2))
16
+ max_response_length=$((1024 * 20))
17
+ enable_overlong_buffer=True
18
+ overlong_buffer_len=$((1024 * 4))
19
+ overlong_penalty_factor=1.0
20
+ loss_agg_mode="token-mean"
21
+ enable_filter_groups=True
22
+ filter_groups_metric=acc
23
+ max_num_gen_batches=10
24
+
25
+ NNODES=2
26
+
27
+ train_prompt_bsz=128
28
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
29
+ n_resp_per_prompt=16
30
+ train_prompt_mini_bsz=32
31
+
32
+ # Ray
33
+ PWD=./
34
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
35
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
36
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
37
+
38
+ # Paths
39
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
40
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-32B"}
41
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
42
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
43
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
44
+
45
+ # Algorithm
46
+ temperature=1.0
47
+ top_p=1.0
48
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
49
+ val_top_p=0.7
50
+
51
+ # Performance Related Parameter
52
+ sp_size=8
53
+ use_dynamic_bsz=True
54
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
55
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
56
+ offload=True
57
+ gen_tp=4
58
+ enable_chunked_prefill=True
59
+
60
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
61
+ --working-dir "${WORKING_DIR}" \
62
+ --address "${RAY_ADDRESS}" \
63
+ -- python3 -m recipe.dapo.main_dapo \
64
+ data.train_files="${TRAIN_FILE}" \
65
+ data.val_files="${TEST_FILE}" \
66
+ data.prompt_key=prompt \
67
+ data.truncation='left' \
68
+ data.max_prompt_length=${max_prompt_length} \
69
+ data.max_response_length=${max_response_length} \
70
+ data.gen_batch_size=${gen_prompt_bsz} \
71
+ data.train_batch_size=${train_prompt_bsz} \
72
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
73
+ algorithm.adv_estimator=${adv_estimator} \
74
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
75
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
76
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
77
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
78
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
79
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
80
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
81
+ algorithm.filter_groups.enable=${enable_filter_groups} \
82
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
83
+ algorithm.filter_groups.metric=${filter_groups_metric} \
84
+ actor_rollout_ref.actor.use_torch_compile=False \
85
+ actor_rollout_ref.ref.use_torch_compile=False \
86
+ actor_rollout_ref.model.use_remove_padding=True \
87
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
88
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
89
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
90
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
91
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
92
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
93
+ actor_rollout_ref.rollout.name=vllm \
94
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
95
+ +actor_rollout_ref.model.override_config.attention_dropout=0. \
96
+ +actor_rollout_ref.model.override_config.embd_pdrop=0. \
97
+ +actor_rollout_ref.model.override_config.resid_pdrop=0. \
98
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
99
+ actor_rollout_ref.actor.optim.lr=1e-6 \
100
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
101
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
102
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
103
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
104
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
105
+ actor_rollout_ref.actor.entropy_coeff=0 \
106
+ actor_rollout_ref.actor.grad_clip=1.0 \
107
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
108
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
109
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.90 \
110
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
111
+ actor_rollout_ref.rollout.enable_chunked_prefill=${enable_chunked_prefill} \
112
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
113
+ actor_rollout_ref.rollout.temperature=${temperature} \
114
+ actor_rollout_ref.rollout.top_p=${top_p} \
115
+ actor_rollout_ref.rollout.top_k="${top_k}" \
116
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
117
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
118
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
119
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
120
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
121
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
122
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
123
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
124
+ reward_model.reward_manager=dapo \
125
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
126
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
127
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
128
+ trainer.logger="['console','wandb']" \
129
+ trainer.project_name="${project_name}" \
130
+ trainer.experiment_name="${exp_name}" \
131
+ trainer.n_gpus_per_node=16 \
132
+ trainer.nnodes="${NNODES}" \
133
+ trainer.val_before_train=True \
134
+ trainer.test_freq=5 \
135
+ trainer.save_freq=20 \
136
+ trainer.total_epochs=1 \
137
+ trainer.default_local_dir="${CKPTS_DIR}" \
138
+ trainer.resume_mode=auto \
139
+ actor_rollout_ref.actor.fsdp_config.forward_prefetch=True \
140
+ actor_rollout_ref.ref.fsdp_config.forward_prefetch=True \
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_32b_rollout_corr.sh ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ # Rollout Correction Example
5
+ # References:
6
+ # - Rollout Correction Docs: https://github.com/volcengine/verl/blob/main/docs/algo/rollout_corr.md
7
+ # - Rollout Correction Math: https://github.com/volcengine/verl/blob/main/docs/algo/rollout_corr_math.md
8
+ # - When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch: https://richardli.xyz/rl-collapse
9
+ # - Off-policy RL: https://fengyao.notion.site/off-policy-rl
10
+
11
+ project_name='DAPO'
12
+ exp_name='DAPO-Qwen2.5-32B-RolloutCorr' # Rollout Correction
13
+
14
+ adv_estimator=grpo
15
+
16
+ use_kl_in_reward=False
17
+ kl_coef=0.0
18
+ use_kl_loss=False
19
+ kl_loss_coef=0.0
20
+
21
+ # Rollout Correction parameters (sequence-level TIS + geometric RS)
22
+ rollout_is=sequence
23
+ rollout_is_threshold=2.0
24
+ rollout_is_batch_normalize=true
25
+ rollout_rs=geometric
26
+ rollout_rs_threshold=1.01
27
+ rollout_rs_threshold_lower=0.99
28
+ rollout_token_veto_threshold=1e-4
29
+
30
+ clip_ratio_low=0.2
31
+ clip_ratio_high=0.28
32
+
33
+ max_prompt_length=$((1024 * 2))
34
+ max_response_length=$((1024 * 20))
35
+ enable_overlong_buffer=True
36
+ overlong_buffer_len=$((1024 * 4))
37
+ overlong_penalty_factor=1.0
38
+
39
+ loss_agg_mode="token-mean"
40
+
41
+ enable_filter_groups=True
42
+ filter_groups_metric=acc
43
+ max_num_gen_batches=10
44
+ train_prompt_bsz=512
45
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
46
+ n_resp_per_prompt=16
47
+ train_prompt_mini_bsz=32
48
+
49
+ # Ray
50
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
51
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
52
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
53
+ NNODES=${NNODES:-16}
54
+ # Paths
55
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
56
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-32B"}
57
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
58
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
59
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
60
+
61
+ # Algorithm
62
+ temperature=1.0
63
+ top_p=1.0
64
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
65
+ val_top_p=0.7
66
+
67
+ # Performance Related Parameter
68
+ sp_size=8
69
+ use_dynamic_bsz=True
70
+ actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
71
+ infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
72
+ offload=True
73
+ gen_tp=4
74
+
75
+
76
+ # Rollout Correction (corrects distribution mismatch between rollout and training)
77
+ #
78
+ # Configuration: DAPO with Rollout Correction:
79
+ # - Self-normalized sequence-level TIS (Truncated Importance Sampling)
80
+ # - Geometric rejection sampling for outlier filtering
81
+ # - Token veto for catastrophic distribution shifts
82
+ #
83
+ # Please note that server mode (agent loop) hasn't returned rollout_log_probs for now,
84
+ # so currently server mode is not supported for Rollout Correction.
85
+ #
86
+ # Rollout Correction parameters (configured at top of script):
87
+ # algorithm.rollout_correction.rollout_is=sequence
88
+ # algorithm.rollout_correction.rollout_is_threshold=2.0
89
+ # algorithm.rollout_correction.rollout_is_batch_normalize=true
90
+ # algorithm.rollout_correction.rollout_rs=geometric
91
+ # algorithm.rollout_correction.rollout_rs_threshold=1.01
92
+ # algorithm.rollout_correction.rollout_rs_threshold_lower=0.99
93
+ # algorithm.rollout_correction.rollout_token_veto_threshold=1e-4
94
+ # actor_rollout_ref.rollout.calculate_log_probs=True # Required!
95
+
96
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
97
+ --working-dir "${WORKING_DIR}" \
98
+ -- python3 -m recipe.dapo.main_dapo \
99
+ data.train_files="${TRAIN_FILE}" \
100
+ data.val_files="${TEST_FILE}" \
101
+ data.prompt_key=prompt \
102
+ data.truncation='left' \
103
+ data.max_prompt_length=${max_prompt_length} \
104
+ data.max_response_length=${max_response_length} \
105
+ data.gen_batch_size=${gen_prompt_bsz} \
106
+ data.train_batch_size=${train_prompt_bsz} \
107
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
108
+ algorithm.adv_estimator=${adv_estimator} \
109
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
110
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
111
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
112
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
113
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
114
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
115
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
116
+ algorithm.filter_groups.enable=${enable_filter_groups} \
117
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
118
+ algorithm.filter_groups.metric=${filter_groups_metric} \
119
+ actor_rollout_ref.model.use_remove_padding=True \
120
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
121
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
122
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
123
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
124
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
125
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
126
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
127
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
128
+ actor_rollout_ref.actor.optim.lr=1e-6 \
129
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
130
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
131
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
132
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
133
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
134
+ actor_rollout_ref.actor.entropy_coeff=0 \
135
+ actor_rollout_ref.actor.grad_clip=1.0 \
136
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
137
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
138
+ algorithm.rollout_correction.rollout_is=${rollout_is} \
139
+ algorithm.rollout_correction.rollout_is_threshold=${rollout_is_threshold} \
140
+ algorithm.rollout_correction.rollout_is_batch_normalize=${rollout_is_batch_normalize} \
141
+ algorithm.rollout_correction.rollout_rs=${rollout_rs} \
142
+ algorithm.rollout_correction.rollout_rs_threshold=${rollout_rs_threshold} \
143
+ algorithm.rollout_correction.rollout_rs_threshold_lower=${rollout_rs_threshold_lower} \
144
+ algorithm.rollout_correction.rollout_token_veto_threshold=${rollout_token_veto_threshold} \
145
+ actor_rollout_ref.rollout.calculate_log_probs=True \
146
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
147
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
148
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
149
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
150
+ actor_rollout_ref.rollout.temperature=${temperature} \
151
+ actor_rollout_ref.rollout.top_p=${top_p} \
152
+ actor_rollout_ref.rollout.top_k="${top_k}" \
153
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
154
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
155
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
156
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
157
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
158
+ actor_rollout_ref.rollout.name=vllm \
159
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
160
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
161
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
162
+ reward_model.reward_manager=dapo \
163
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
164
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
165
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
166
+ trainer.logger='["console","wandb"]' \
167
+ trainer.project_name="${project_name}" \
168
+ trainer.experiment_name="${exp_name}" \
169
+ trainer.n_gpus_per_node=8 \
170
+ trainer.nnodes="${NNODES}" \
171
+ trainer.val_before_train=True \
172
+ trainer.test_freq=5 \
173
+ trainer.save_freq=5 \
174
+ trainer.total_epochs=1 \
175
+ trainer.default_local_dir="${CKPTS_DIR}" \
176
+ trainer.resume_mode=auto
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen2.5_7b_npu.sh ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO-Qwen2.5-7B-Instruct'
5
+ exp_name='DAPO-Qwen2.5-7B-Instruct'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+ clip_ratio_low=0.2
14
+ clip_ratio_high=0.28
15
+ max_prompt_length=$((1024 * 2))
16
+ max_response_length=$((1024 * 20))
17
+ enable_overlong_buffer=True
18
+ overlong_buffer_len=$((1024 * 4))
19
+ overlong_penalty_factor=1.0
20
+ loss_agg_mode="token-mean"
21
+ enable_filter_groups=True
22
+ filter_groups_metric=acc
23
+ max_num_gen_batches=10
24
+
25
+ NNODES=1
26
+
27
+ train_prompt_bsz=16
28
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
29
+ n_resp_per_prompt=16
30
+ train_prompt_mini_bsz=1
31
+
32
+ # Ray
33
+ PWD=./
34
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
35
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
36
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
37
+
38
+ # Paths
39
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
40
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-7B-Instruct"}
41
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
42
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
43
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
44
+
45
+ # Algorithm
46
+ temperature=1.0
47
+ top_p=1.0
48
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
49
+
50
+ # Performance Related Parameter
51
+ sp_size=4
52
+ use_dynamic_bsz=True
53
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
54
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
55
+ offload=True
56
+ gen_tp=1
57
+
58
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
59
+ --working-dir "${WORKING_DIR}" \
60
+ --address "${RAY_ADDRESS}" \
61
+ -- python3 -m recipe.dapo.main_dapo \
62
+ data.train_files="${TRAIN_FILE}" \
63
+ data.val_files="${TEST_FILE}" \
64
+ data.prompt_key=prompt \
65
+ data.truncation='left' \
66
+ data.max_prompt_length=${max_prompt_length} \
67
+ data.max_response_length=${max_response_length} \
68
+ data.gen_batch_size=${gen_prompt_bsz} \
69
+ data.train_batch_size=${train_prompt_bsz} \
70
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
71
+ algorithm.adv_estimator=${adv_estimator} \
72
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
73
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
74
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
75
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
76
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
77
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
78
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
79
+ algorithm.filter_groups.enable=${enable_filter_groups} \
80
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
81
+ algorithm.filter_groups.metric=${filter_groups_metric} \
82
+ actor_rollout_ref.actor.use_torch_compile=False \
83
+ actor_rollout_ref.ref.use_torch_compile=False \
84
+ actor_rollout_ref.model.use_remove_padding=True \
85
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
86
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
87
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
88
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
89
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
90
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
91
+ actor_rollout_ref.rollout.name=vllm \
92
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
93
+ +actor_rollout_ref.model.override_config.attention_dropout=0. \
94
+ +actor_rollout_ref.model.override_config.embd_pdrop=0. \
95
+ +actor_rollout_ref.model.override_config.resid_pdrop=0. \
96
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
97
+ actor_rollout_ref.actor.optim.lr=1e-6 \
98
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
99
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
100
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
101
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
102
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
103
+ actor_rollout_ref.actor.entropy_coeff=0 \
104
+ actor_rollout_ref.actor.grad_clip=1.0 \
105
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
106
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
107
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.50 \
108
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
109
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
110
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
111
+ actor_rollout_ref.rollout.temperature=${temperature} \
112
+ actor_rollout_ref.rollout.top_p=${top_p} \
113
+ actor_rollout_ref.rollout.top_k="${top_k}" \
114
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
115
+ actor_rollout_ref.rollout.val_kwargs.top_p=${top_p} \
116
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
117
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
118
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
119
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
120
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
121
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
122
+ reward_model.reward_manager=dapo \
123
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
124
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
125
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
126
+ trainer.logger="['console']" \
127
+ trainer.project_name="${project_name}" \
128
+ trainer.experiment_name="${exp_name}" \
129
+ trainer.n_gpus_per_node=16 \
130
+ trainer.nnodes="${NNODES}" \
131
+ trainer.val_before_train=True \
132
+ trainer.test_freq=5 \
133
+ trainer.save_freq=20 \
134
+ trainer.total_epochs=1 \
135
+ trainer.default_local_dir="${CKPTS_DIR}" \
136
+ trainer.resume_mode=auto \
137
+ actor_rollout_ref.actor.entropy_checkpointing=True \
138
+ actor_rollout_ref.ref.entropy_checkpointing=True \
139
+ actor_rollout_ref.actor.fsdp_config.forward_prefetch=True \
140
+ actor_rollout_ref.ref.fsdp_config.forward_prefetch=True \
141
+ actor_rollout_ref.actor.entropy_from_logits_with_chunking=True \
142
+ actor_rollout_ref.ref.entropy_from_logits_with_chunking=True
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_14b_base_npu.sh ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ project_name='DAPO'
3
+ exp_name='DAPO-Qwen3-14B-Base'
4
+
5
+ adv_estimator=grpo
6
+
7
+ use_kl_in_reward=False
8
+ kl_coef=0.0
9
+ use_kl_loss=False
10
+ kl_loss_coef=0.0
11
+
12
+ clip_ratio_low=0.2
13
+ clip_ratio_high=0.28
14
+
15
+ max_prompt_length=$((1024 * 2))
16
+ max_response_length=$((1024 * 20))
17
+ enable_overlong_buffer=True
18
+ overlong_buffer_len=$((1024 * 4))
19
+ overlong_penalty_factor=1.0
20
+
21
+ loss_agg_mode="token-mean"
22
+
23
+ enable_filter_groups=False
24
+ filter_groups_metric=acc
25
+ max_num_gen_batches=10
26
+ train_prompt_bsz=16
27
+ gen_prompt_bsz=$((train_prompt_bsz * 2))
28
+ n_resp_per_prompt=16
29
+ train_prompt_mini_bsz=1
30
+
31
+ # Ray
32
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
33
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
34
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
35
+ NNODES=${NNODES:-2}
36
+ # Paths
37
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
38
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-14B-Base"}
39
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
40
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
41
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
42
+
43
+ # Algorithm
44
+ temperature=1.0
45
+ top_p=1.0
46
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
47
+
48
+ # Performance Related Parameter
49
+ sp_size=2
50
+ use_dynamic_bsz=True
51
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
52
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / sp_size))
53
+ offload=True
54
+ gen_tp=2
55
+
56
+ ray job submit --runtime-env="${RUNTIME_ENV}" \
57
+ --address "${RAY_ADDRESS}" \
58
+ -- python3 -m recipe.dapo.main_dapo \
59
+ data.train_files="${TRAIN_FILE}" \
60
+ data.val_files="${TEST_FILE}" \
61
+ data.prompt_key=prompt \
62
+ data.truncation='left' \
63
+ data.max_prompt_length=${max_prompt_length} \
64
+ data.max_response_length=${max_response_length} \
65
+ data.gen_batch_size=${gen_prompt_bsz} \
66
+ data.train_batch_size=${train_prompt_bsz} \
67
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
68
+ algorithm.adv_estimator=${adv_estimator} \
69
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
70
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
71
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
72
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
73
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
74
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
75
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
76
+ algorithm.filter_groups.enable=${enable_filter_groups} \
77
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
78
+ algorithm.filter_groups.metric=${filter_groups_metric} \
79
+ actor_rollout_ref.model.use_remove_padding=True \
80
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
81
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
82
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
83
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
84
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
85
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
86
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
87
+ +actor_rollout_ref.model.override_config.attention_dropout=0. \
88
+ +actor_rollout_ref.model.override_config.embd_pdrop=0. \
89
+ +actor_rollout_ref.model.override_config.resid_pdrop=0. \
90
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
91
+ actor_rollout_ref.actor.optim.lr=1e-6 \
92
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
93
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
94
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
95
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
96
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
97
+ actor_rollout_ref.actor.entropy_coeff=0 \
98
+ actor_rollout_ref.actor.grad_clip=1.0 \
99
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
100
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
101
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
102
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
103
+ actor_rollout_ref.rollout.enable_chunked_prefill=False \
104
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
105
+ actor_rollout_ref.rollout.temperature=${temperature} \
106
+ actor_rollout_ref.rollout.top_p=${top_p} \
107
+ actor_rollout_ref.rollout.top_k="${top_k}" \
108
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
109
+ actor_rollout_ref.rollout.val_kwargs.top_p=${top_p} \
110
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
111
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
112
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
113
+ actor_rollout_ref.rollout.name=vllm \
114
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
115
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
116
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=8 \
117
+ reward_model.reward_manager=dapo \
118
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
119
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
120
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
121
+ trainer.logger=['console'] \
122
+ trainer.project_name="${project_name}" \
123
+ trainer.experiment_name="${exp_name}" \
124
+ trainer.n_gpus_per_node=16 \
125
+ trainer.nnodes="${NNODES}" \
126
+ trainer.val_before_train=False \
127
+ trainer.test_freq=10 \
128
+ trainer.save_freq=20 \
129
+ trainer.total_epochs=1 \
130
+ trainer.total_training_steps=100 \
131
+ trainer.default_local_dir="${CKPTS_DIR}" \
132
+ trainer.resume_mode=auto \
133
+ data.shuffle=False \
134
+ actor_rollout_ref.actor.use_torch_compile=False \
135
+ actor_rollout_ref.ref.use_torch_compile=False \
136
+ actor_rollout_ref.actor.entropy_checkpointing=True \
137
+ actor_rollout_ref.ref.entropy_checkpointing=True \
138
+ actor_rollout_ref.actor.fsdp_config.forward_prefetch=True \
139
+ actor_rollout_ref.ref.fsdp_config.forward_prefetch=True
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_30b_fsdp_6k_npu.sh ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ export VLLM_USE_V1=1
5
+ export HCCL_OP_EXPANSION_MODE="AIV"
6
+ export VLLM_ASCEND_ENABLE_FLASHCOMM=1
7
+ export HCCL_EXEC_TIMEOUT=3600
8
+ export HCCL_CONNECT_TIMEOUT=3600
9
+
10
+ project_name='DAPO'
11
+ exp_name='DAPO-Qwen3-30B'
12
+
13
+ adv_estimator=grpo
14
+
15
+ use_kl_in_reward=False
16
+ kl_coef=0.0
17
+ use_kl_loss=False
18
+ kl_loss_coef=0.0
19
+
20
+ clip_ratio_low=0.2
21
+ clip_ratio_high=0.28
22
+
23
+ max_prompt_length=$((1024 * 2))
24
+ max_response_length=$((1024 * 6))
25
+ enable_overlong_buffer=True
26
+ overlong_buffer_len=$((1024 * 4))
27
+ overlong_penalty_factor=1.0
28
+
29
+ loss_agg_mode="token-mean"
30
+
31
+ enable_filter_groups=True
32
+ filter_groups_metric=acc
33
+ max_num_gen_batches=10
34
+ train_prompt_bsz=32
35
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
36
+ n_resp_per_prompt=16
37
+ max_num_seqs=1024
38
+ train_prompt_mini_bsz=32
39
+
40
+ # Ray
41
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
42
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
43
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
44
+ NNODES=${NNODES:-2}
45
+ # Paths
46
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
47
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-30B"}
48
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
49
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
50
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
51
+
52
+ # Algorithm
53
+ temperature=1.0
54
+ top_p=1.0
55
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
56
+ val_top_p=0.7
57
+
58
+ # Performance Related Parameter
59
+ sp_size=1
60
+ use_dynamic_bsz=True
61
+ log_prob_micro_batch_size_per_gpu=1
62
+ ppo_micro_batch_size_per_gpu=1
63
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
64
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
65
+ max_num_batched_tokens=$(((max_prompt_length + max_response_length) * 4))
66
+ offload=True
67
+ gen_tp=2
68
+ gen_dp=1
69
+ enable_chunked_prefill=True
70
+
71
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
72
+ --working-dir "${WORKING_DIR}" \
73
+ --address "${RAY_ADDRESS}" \
74
+ -- python3 -m recipe.dapo.main_dapo \
75
+ data.train_files="${TRAIN_FILE}" \
76
+ data.val_files="${TEST_FILE}" \
77
+ data.prompt_key=prompt \
78
+ data.truncation='left' \
79
+ data.max_prompt_length=${max_prompt_length} \
80
+ data.max_response_length=${max_response_length} \
81
+ data.gen_batch_size=${gen_prompt_bsz} \
82
+ data.train_batch_size=${train_prompt_bsz} \
83
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
84
+ actor_rollout_ref.rollout.max_num_seqs=${max_num_seqs} \
85
+ actor_rollout_ref.rollout.max_num_batched_tokens=${max_num_batched_tokens} \
86
+ algorithm.adv_estimator=${adv_estimator} \
87
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
88
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
89
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
90
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
91
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
92
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
93
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
94
+ algorithm.filter_groups.enable=${enable_filter_groups} \
95
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
96
+ algorithm.filter_groups.metric=${filter_groups_metric} \
97
+ actor_rollout_ref.model.use_remove_padding=True \
98
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
99
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
100
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
101
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
102
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
103
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
104
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
105
+ +actor_rollout_ref.model.override_config.attention_dropout=0. \
106
+ +actor_rollout_ref.model.override_config.embd_pdrop=0. \
107
+ +actor_rollout_ref.model.override_config.resid_pdrop=0. \
108
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
109
+ actor_rollout_ref.actor.optim.lr=1e-6 \
110
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
111
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
112
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
113
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
114
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
115
+ actor_rollout_ref.actor.entropy_coeff=0 \
116
+ actor_rollout_ref.actor.grad_clip=1.0 \
117
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
118
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
119
+ actor_rollout_ref.rollout.name=vllm \
120
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
121
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
122
+ actor_rollout_ref.rollout.data_parallel_size=${gen_dp} \
123
+ actor_rollout_ref.rollout.enable_chunked_prefill=${enable_chunked_prefill} \
124
+ actor_rollout_ref.rollout.temperature=${temperature} \
125
+ actor_rollout_ref.rollout.top_p=${top_p} \
126
+ actor_rollout_ref.rollout.top_k="${top_k}" \
127
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
128
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
129
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
130
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
131
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
132
+ actor_rollout_ref.rollout.enforce_eager=False \
133
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
134
+ actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
135
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
136
+ actor_rollout_ref.actor.strategy=fsdp \
137
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
138
+ actor_rollout_ref.rollout.free_cache_engine=True \
139
+ actor_rollout_ref.rollout.expert_parallel_size=1 \
140
+ reward_model.reward_manager=dapo \
141
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
142
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
143
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
144
+ trainer.logger='["console"]' \
145
+ trainer.project_name="${project_name}" \
146
+ trainer.experiment_name="${exp_name}" \
147
+ trainer.n_gpus_per_node=16 \
148
+ trainer.nnodes="${NNODES}" \
149
+ trainer.device='npu' \
150
+ trainer.val_before_train=False \
151
+ trainer.test_freq=200 \
152
+ trainer.save_freq=50 \
153
+ trainer.total_epochs=100 \
154
+ trainer.default_local_dir="${CKPTS_DIR}" \
155
+ trainer.resume_mode=auto \
156
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
157
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
158
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
159
+ ++actor_rollout_ref.nccl_timeout=7200 \
160
+ actor_rollout_ref.actor.use_torch_compile=False \
161
+ actor_rollout_ref.ref.use_torch_compile=False $@
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_moe_30b_base_fsdp_npu.sh ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euxo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Qwen3-MOE-30B-FSDP-128rank-gbs512'
6
+
7
+ NNODES=8
8
+ NPUS_PER_NODE=16
9
+
10
+ adv_estimator=grpo
11
+
12
+ use_kl_in_reward=False
13
+ kl_coef=0.0
14
+ use_kl_loss=False
15
+ kl_loss_coef=0.0
16
+
17
+ clip_ratio_low=0.2
18
+ clip_ratio_high=0.28
19
+
20
+ max_prompt_length=$((1024 * 2))
21
+ max_response_length=$((1024 * 20))
22
+ enable_overlong_buffer=True
23
+ overlong_buffer_len=$((1024 * 4))
24
+ overlong_penalty_factor=1.0
25
+ loss_agg_mode="token-mean"
26
+ ppo_mini_batch_size=32
27
+
28
+ enable_filter_groups=True
29
+ filter_groups_metric=acc
30
+ max_num_gen_batches=10
31
+ train_prompt_bsz=512
32
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
33
+ n_resp_per_prompt=16
34
+
35
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
36
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
37
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
38
+
39
+ # Paths
40
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
41
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-30B-A3B-Base"}
42
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
43
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
44
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
45
+
46
+ # Algorithm
47
+ temperature=1.0
48
+ top_p=1.0
49
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
50
+ val_top_p=0.7
51
+
52
+ # Performance Related Parameter
53
+ sp_size=16 # For load-balance. For smaller cluster this can be set to as less as 2.
54
+ use_dynamic_bsz=True
55
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) / 2))
56
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) / 2))
57
+ offload=True
58
+ recompute=True
59
+ max_num_seqs=128
60
+ gen_tp=2
61
+
62
+
63
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
64
+ -- python3 -m recipe.dapo.main_dapo \
65
+ data.train_files="${TRAIN_FILE}" \
66
+ data.val_files="${TEST_FILE}" \
67
+ data.prompt_key=prompt \
68
+ data.truncation='left' \
69
+ data.max_prompt_length=${max_prompt_length} \
70
+ data.max_response_length=${max_response_length} \
71
+ data.gen_batch_size=${gen_prompt_bsz} \
72
+ data.train_batch_size=${train_prompt_bsz} \
73
+ actor_rollout_ref.rollout.name=vllm \
74
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
75
+ actor_rollout_ref.rollout.max_num_seqs=${max_num_seqs} \
76
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
77
+ algorithm.adv_estimator=${adv_estimator} \
78
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
79
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
80
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
81
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
82
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
83
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
84
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
85
+ algorithm.filter_groups.enable=${enable_filter_groups} \
86
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
87
+ algorithm.filter_groups.metric=${filter_groups_metric} \
88
+ actor_rollout_ref.model.use_remove_padding=True \
89
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
90
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
91
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
92
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
93
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
94
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
95
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
96
+ +actor_rollout_ref.model.override_config.attention_dropout=0. \
97
+ +actor_rollout_ref.model.override_config.embd_pdrop=0. \
98
+ +actor_rollout_ref.model.override_config.resid_pdrop=0. \
99
+ actor_rollout_ref.model.enable_gradient_checkpointing=${recompute} \
100
+ actor_rollout_ref.actor.optim.lr=1e-6 \
101
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
102
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
103
+ actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size} \
104
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
105
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
106
+ actor_rollout_ref.actor.fsdp_config.forward_prefetch=False \
107
+ actor_rollout_ref.actor.entropy_coeff=0 \
108
+ actor_rollout_ref.actor.grad_clip=1.0 \
109
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
110
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
111
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
112
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
113
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
114
+ actor_rollout_ref.rollout.temperature=${temperature} \
115
+ actor_rollout_ref.rollout.top_p=${top_p} \
116
+ actor_rollout_ref.rollout.top_k=${top_k} \
117
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
118
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
119
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
120
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
121
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
122
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
123
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
124
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
125
+ actor_rollout_ref.ref.fsdp_config.forward_prefetch=False \
126
+ actor_rollout_ref.rollout.enforce_eager=False \
127
+ actor_rollout_ref.rollout.free_cache_engine=True \
128
+ reward_model.reward_manager=dapo \
129
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
130
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
131
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
132
+ trainer.logger=['console','wandb'] \
133
+ trainer.project_name="${project_name}" \
134
+ trainer.experiment_name="${exp_name}" \
135
+ trainer.n_gpus_per_node="${NPUS_PER_NODE}" \
136
+ trainer.nnodes="${NNODES}" \
137
+ trainer.val_before_train=False \
138
+ trainer.test_freq=5 \
139
+ trainer.save_freq=-1 \
140
+ trainer.total_epochs=1 \
141
+ actor_rollout_ref.actor.use_torch_compile=False \
142
+ actor_rollout_ref.ref.use_torch_compile=False
143
+
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_moe_30b_megatron_npu.sh ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ project_name='DAPO'
4
+ exp_name='DAPO-Qwen3-30B-megatron'
5
+
6
+ adv_estimator=grpo
7
+
8
+ use_kl_in_reward=False
9
+ kl_coef=0.0
10
+ use_kl_loss=False
11
+ kl_loss_coef=0.0
12
+
13
+ clip_ratio_low=0.2
14
+ clip_ratio_high=0.28
15
+
16
+ max_prompt_length=$((1024 * 2))
17
+ max_response_length=$((1024 * 20))
18
+ enable_overlong_buffer=True
19
+ overlong_buffer_len=$((1024 * 4))
20
+ overlong_penalty_factor=1.0
21
+
22
+ loss_agg_mode="token-mean"
23
+
24
+ enable_filter_groups=True
25
+ filter_groups_metric=acc
26
+ max_num_gen_batches=10
27
+ train_prompt_bsz=16
28
+ gen_prompt_bsz=$((train_prompt_bsz * 2))
29
+ n_resp_per_prompt=16
30
+ train_prompt_mini_bsz=2
31
+
32
+ # Ray
33
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
34
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
35
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
36
+ NNODES=${NNODES:-1}
37
+ # Paths
38
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
39
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-30B-A3B"}
40
+ # MCORE_MODEL_PATH points to the converted checkpoint.
41
+ # To avoid loading these weights, set actor_rollout_ref.actor.megatron.use_dist_checkpointing=False.
42
+ MCORE_MODEL_PATH=${MCORE_MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-30B-A3B-dist_ckpt"}
43
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
44
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
45
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
46
+
47
+ # Algorithm
48
+ temperature=1.0
49
+ top_p=1.0
50
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
51
+
52
+ # Performance Related Parameter
53
+ sp_size=8
54
+ use_dynamic_bsz=True
55
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length)))
56
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length)))
57
+ offload=True
58
+
59
+ max_num_batched_tokens=$((max_prompt_length + max_response_length))
60
+
61
+ # vllm
62
+ gen_tp=4
63
+
64
+ # Megatron backen
65
+ train_tp=4
66
+ train_ep=2
67
+ train_pp=2
68
+ train_cp=1
69
+
70
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
71
+ --address "${RAY_ADDRESS}" \
72
+ -- python3 -m recipe.dapo.main_dapo \
73
+ --config-name="dapo_megatron_trainer" \
74
+ data.filter_overlong_prompts=False \
75
+ data.train_files="${TRAIN_FILE}" \
76
+ data.val_files="${TEST_FILE}" \
77
+ data.shuffle=False \
78
+ data.prompt_key=prompt \
79
+ data.truncation='left' \
80
+ data.max_prompt_length=${max_prompt_length} \
81
+ data.max_response_length=${max_response_length} \
82
+ data.gen_batch_size=${gen_prompt_bsz} \
83
+ data.train_batch_size=${train_prompt_bsz} \
84
+ algorithm.adv_estimator=${adv_estimator} \
85
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
86
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
87
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
88
+ actor_rollout_ref.rollout.name=vllm \
89
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
90
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
91
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
92
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
93
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
94
+ actor_rollout_ref.actor.ppo_epochs=1 \
95
+ algorithm.filter_groups.enable=${enable_filter_groups} \
96
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
97
+ algorithm.filter_groups.metric=${filter_groups_metric} \
98
+ actor_rollout_ref.model.use_remove_padding=True \
99
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
100
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
101
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
102
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
103
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
104
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
105
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
106
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
107
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
108
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
109
+ +actor_rollout_ref.model.override_config.attention_dropout=0. \
110
+ +actor_rollout_ref.model.override_config.embd_pdrop=0. \
111
+ +actor_rollout_ref.model.override_config.resid_pdrop=0. \
112
+ actor_rollout_ref.actor.optim.lr=1e-6 \
113
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
114
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
115
+ actor_rollout_ref.actor.megatron.param_offload=${offload} \
116
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
117
+ actor_rollout_ref.actor.megatron.grad_offload=${offload} \
118
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} \
119
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} \
120
+ actor_rollout_ref.actor.megatron.expert_model_parallel_size=${train_ep} \
121
+ actor_rollout_ref.actor.megatron.context_parallel_size=${train_cp} \
122
+ actor_rollout_ref.actor.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH} \
123
+ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
124
+ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp} \
125
+ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp} \
126
+ actor_rollout_ref.ref.megatron.expert_model_parallel_size=${train_ep} \
127
+ actor_rollout_ref.ref.megatron.context_parallel_size=${train_cp} \
128
+ actor_rollout_ref.ref.megatron.param_offload=${offload} \
129
+ actor_rollout_ref.ref.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH} \
130
+ actor_rollout_ref.actor.entropy_coeff=0 \
131
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
132
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
133
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
134
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
135
+ actor_rollout_ref.rollout.enable_prefix_caching=False \
136
+ actor_rollout_ref.rollout.max_num_batched_tokens=${max_num_batched_tokens} \
137
+ actor_rollout_ref.rollout.max_model_len=$((max_prompt_length + max_response_length)) \
138
+ actor_rollout_ref.rollout.temperature=${temperature} \
139
+ actor_rollout_ref.rollout.top_p=${top_p} \
140
+ actor_rollout_ref.rollout.top_k=${top_k} \
141
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
142
+ actor_rollout_ref.rollout.val_kwargs.top_p=${top_p} \
143
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
144
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
145
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
146
+ actor_rollout_ref.rollout.enforce_eager=True \
147
+ actor_rollout_ref.rollout.free_cache_engine=True \
148
+ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
149
+ reward_model.reward_manager=dapo \
150
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
151
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
152
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
153
+ trainer.logger=['console'] \
154
+ trainer.project_name="${project_name}" \
155
+ trainer.experiment_name="${exp_name}" \
156
+ trainer.n_gpus_per_node=16 \
157
+ trainer.nnodes="${NNODES}" \
158
+ trainer.val_before_train=False \
159
+ trainer.test_freq=-1 \
160
+ trainer.save_freq=-1 \
161
+ trainer.total_epochs=1 \
162
+ trainer.default_local_dir="${CKPTS_DIR}" \
163
+ actor_rollout_ref.nccl_timeout=14400 \
164
+ actor_rollout_ref.actor.use_torch_compile=False \
165
+ actor_rollout_ref.ref.use_torch_compile=False \
166
+ +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True \
167
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform \
168
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full \
169
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
170
+
ICL/DAPO/verl-recipe/dapo/run_dapo_qwen3_moe_30b_vllm_fp8_rollout.sh ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO-FP8-ROLLOUT'
5
+ exp_name='DAPO-Qwen3-MOE-30B-VLLM-FP8-ROLLOUT'
6
+
7
+
8
+ adv_estimator=grpo
9
+
10
+ use_kl_in_reward=False
11
+ kl_coef=0.0
12
+ use_kl_loss=False
13
+ kl_loss_coef=0.0
14
+
15
+ clip_ratio_low=0.2
16
+ clip_ratio_high=0.28
17
+
18
+ # Rollout Correction parameters for FP8 rollout
19
+ rollout_is=token
20
+ rollout_is_threshold=2.0
21
+ rollout_rs=null
22
+ rollout_rs_threshold=null
23
+ rollout_rs_threshold_lower=null
24
+ rollout_token_veto_threshold=null
25
+
26
+ max_prompt_length=$((1024))
27
+ max_response_length=$((1024 * 20))
28
+ enable_overlong_buffer=True
29
+ overlong_buffer_len=512
30
+ overlong_penalty_factor=1.0
31
+
32
+ loss_agg_mode="token-mean"
33
+
34
+ enable_filter_groups=True
35
+ filter_groups_metric=acc
36
+ max_num_gen_batches=10
37
+ train_prompt_bsz=32
38
+ n_resp_per_prompt=16
39
+ train_prompt_mini_bsz=32
40
+ gen_prompt_bsz=96
41
+
42
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
43
+ echo "WORKING_DIR: ${WORKING_DIR}"
44
+ # For vllm 0.11.x, DEEP_GEMM is enabled by default.
45
+ # For vllm 0.10.x, please set VLLM_USE_DEEP_GEMM=1 in runtime_env.yaml
46
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
47
+ echo "RUNTIME_ENV: ${RUNTIME_ENV}"
48
+ NNODES=${NNODES:-2}
49
+ echo "NNODES: ${NNODES}"
50
+
51
+ # Paths
52
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
53
+ MODEL_PATH="Qwen/Qwen3-30B-A3B-Base"
54
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
55
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
56
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
57
+
58
+ # Algorithm
59
+ temperature=1.0
60
+ top_p=1.0
61
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
62
+ val_top_p=1.0
63
+
64
+ # Performance Related Parameter
65
+ sp_size=4
66
+ use_dynamic_bsz=True
67
+ actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
68
+ infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
69
+ offload=true
70
+ gen_tp=1
71
+ train_tp=1
72
+ train_pp=1
73
+
74
+ # Set Flash-RL environment variables
75
+ export VERL_LOGGING_LEVEL=DEBUG
76
+ export VLLM_LOGGING_LEVEL=DEBUG
77
+ export VLLM_CONFIGURE_LOGGING=1
78
+ export VLLM_USE_V1=1
79
+ export VLLM_USE_DEEP_GEMM=1
80
+ export TORCH_NCCL_AVOID_RECORD_STREAMS=1
81
+
82
+ RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --runtime-env=${RUNTIME_ENV} \
83
+ -- python3 -m recipe.dapo.main_dapo \
84
+ data.train_files="${TRAIN_FILE}" \
85
+ data.val_files="${TEST_FILE}" \
86
+ data.prompt_key=prompt \
87
+ data.truncation='left' \
88
+ data.return_raw_chat=True \
89
+ data.filter_overlong_prompts=True \
90
+ data.max_prompt_length=${max_prompt_length} \
91
+ data.max_response_length=${max_response_length} \
92
+ data.train_batch_size=${train_prompt_bsz} \
93
+ data.gen_batch_size=${gen_prompt_bsz} \
94
+ actor_rollout_ref.nccl_timeout=1800 \
95
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
96
+ algorithm.adv_estimator=${adv_estimator} \
97
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
98
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
99
+ algorithm.filter_groups.enable=${enable_filter_groups} \
100
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
101
+ algorithm.filter_groups.metric=${filter_groups_metric} \
102
+ algorithm.rollout_correction.rollout_is=${rollout_is} \
103
+ algorithm.rollout_correction.rollout_is_threshold=${rollout_is_threshold} \
104
+ algorithm.rollout_correction.rollout_rs=${rollout_rs} \
105
+ algorithm.rollout_correction.rollout_rs_threshold=${rollout_rs_threshold} \
106
+ algorithm.rollout_correction.rollout_rs_threshold_lower=${rollout_rs_threshold_lower} \
107
+ algorithm.rollout_correction.rollout_token_veto_threshold=${rollout_token_veto_threshold} \
108
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
109
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
110
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
111
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
112
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
113
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
114
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
115
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
116
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
117
+ actor_rollout_ref.model.use_remove_padding=True \
118
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
119
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
120
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
121
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
122
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
123
+ actor_rollout_ref.actor.optim.lr=1e-6 \
124
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
125
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
126
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
127
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
128
+ actor_rollout_ref.actor.entropy_coeff=0 \
129
+ actor_rollout_ref.actor.optim.clip_grad=1.0 \
130
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
131
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
132
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
133
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
134
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
135
+ actor_rollout_ref.rollout.max_num_batched_tokens=$(( 1024 * 32 )) \
136
+ actor_rollout_ref.rollout.max_num_seqs=256 \
137
+ actor_rollout_ref.rollout.temperature=${temperature} \
138
+ actor_rollout_ref.rollout.top_p=${top_p} \
139
+ actor_rollout_ref.rollout.top_k=${top_k} \
140
+ actor_rollout_ref.rollout.val_kwargs.temperature=0.6 \
141
+ actor_rollout_ref.rollout.val_kwargs.top_p=${top_p} \
142
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
143
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
144
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
145
+ +actor_rollout_ref.rollout.quantization=fp8 \
146
+ actor_rollout_ref.rollout.name=vllm \
147
+ actor_rollout_ref.rollout.mode=async \
148
+ actor_rollout_ref.rollout.calculate_log_probs=True \
149
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
150
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
151
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
152
+ reward_model.reward_manager=dapo \
153
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
154
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
155
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
156
+ reward_model.overlong_buffer.log=False \
157
+ trainer.logger='["console","wandb"]' \
158
+ trainer.project_name="${project_name}" \
159
+ trainer.experiment_name="${exp_name}" \
160
+ trainer.n_gpus_per_node=8 \
161
+ trainer.nnodes="${NNODES}" \
162
+ trainer.val_before_train=False \
163
+ trainer.test_freq=5 \
164
+ trainer.save_freq=5 \
165
+ trainer.total_epochs=100 \
166
+ trainer.default_local_dir="${CKPTS_DIR}" \
167
+ trainer.resume_mode=auto \
168
+ trainer.log_val_generations=1 \
169
+ trainer.total_training_steps=500 \
170
+ trainer.max_actor_ckpt_to_keep=5 \
171
+ actor_rollout_ref.rollout.enforce_eager=False
ICL/DAPO/verl-recipe/dapo/run_dapo_wo_ds_qwen2.5_32b.sh ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euxo pipefail
3
+ # DAPO (w/o Dynamic Sampling)
4
+
5
+ project_name='DAPO-verl'
6
+ exp_name='DAPO-wo-DS-Qwen2.5-32B'
7
+
8
+ adv_estimator=grpo
9
+
10
+ use_kl_in_reward=False
11
+ kl_coef=0.0
12
+ use_kl_loss=False
13
+ kl_loss_coef=0.0
14
+
15
+ clip_ratio_low=0.2
16
+ clip_ratio_high=0.28
17
+
18
+ max_prompt_length=$((1024 * 2))
19
+ max_response_length=$((1024 * 20))
20
+ enable_overlong_buffer=True
21
+ overlong_buffer_len=$((1024 * 4))
22
+ overlong_penalty_factor=1.0
23
+
24
+ loss_agg_mode="token-mean"
25
+
26
+ enable_filter_groups=False
27
+ train_prompt_bsz=512
28
+ n_resp_per_prompt=16
29
+ train_prompt_mini_bsz=32
30
+
31
+ # Ray
32
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
33
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
34
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
35
+ NNODES=${NNODES:-16}
36
+ # Paths
37
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
38
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-32B"}
39
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
40
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
41
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
42
+
43
+ # Algorithm
44
+ temperature=1.0
45
+ top_p=1.0
46
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
47
+ val_top_p=0.7
48
+
49
+ # Performance Related Parameter
50
+ sp_size=8
51
+ use_dynamic_bsz=True
52
+ actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
53
+ infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
54
+ offload=True
55
+ gen_tp=4
56
+
57
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
58
+ --working-dir "${WORKING_DIR}" \
59
+ -- python3 -m recipe.dapo.main_dapo \
60
+ data.train_files="${TRAIN_FILE}" \
61
+ data.val_files="${TEST_FILE}" \
62
+ data.prompt_key=prompt \
63
+ data.truncation='left' \
64
+ data.max_prompt_length=${max_prompt_length} \
65
+ data.max_response_length=${max_response_length} \
66
+ data.train_batch_size=${train_prompt_bsz} \
67
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
68
+ algorithm.adv_estimator=${adv_estimator} \
69
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
70
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
71
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
72
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
73
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
74
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
75
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
76
+ algorithm.filter_groups.enable=${enable_filter_groups} \
77
+ actor_rollout_ref.model.use_remove_padding=True \
78
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
79
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
80
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
81
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
82
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
83
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
84
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
85
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
86
+ actor_rollout_ref.actor.optim.lr=1e-6 \
87
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
88
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
89
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
90
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
91
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
92
+ actor_rollout_ref.actor.entropy_coeff=0 \
93
+ actor_rollout_ref.actor.grad_clip=1.0 \
94
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
95
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
96
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
97
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
98
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
99
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
100
+ actor_rollout_ref.rollout.temperature=${temperature} \
101
+ actor_rollout_ref.rollout.top_p=${top_p} \
102
+ actor_rollout_ref.rollout.top_k="${top_k}" \
103
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
104
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
105
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
106
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
107
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
108
+ actor_rollout_ref.rollout.name=vllm \
109
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
110
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
111
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
112
+ reward_model.reward_manager=dapo \
113
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
114
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
115
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
116
+ trainer.logger='["console","wandb"]' \
117
+ trainer.project_name="${project_name}" \
118
+ trainer.experiment_name="${exp_name}" \
119
+ trainer.n_gpus_per_node=8 \
120
+ trainer.nnodes="${NNODES}" \
121
+ trainer.val_before_train=True \
122
+ trainer.test_freq=5 \
123
+ trainer.save_freq=5 \
124
+ trainer.total_epochs=1 \
125
+ trainer.default_local_dir="${CKPTS_DIR}" \
126
+ trainer.resume_mode=auto
ICL/DAPO/verl-recipe/dapo/runtime_env.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ working_dir: ./
2
+ excludes: ["/.git/"]
3
+ env_vars:
4
+ TORCH_NCCL_AVOID_RECORD_STREAMS: "1"
5
+ VLLM_USE_V1: "1"
ICL/DAPO/verl-recipe/dapo/test_dapo_7b.sh ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Qwen2.5-7B-Math-Test'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+
14
+ clip_ratio_low=0.2
15
+ clip_ratio_high=0.28
16
+
17
+ max_prompt_length=$((1024 * 2))
18
+ max_response_length=$((1024 * 2))
19
+ enable_overlong_buffer=True
20
+ overlong_buffer_len=512
21
+ overlong_penalty_factor=1.0
22
+
23
+ loss_agg_mode="token-mean"
24
+
25
+ enable_filter_groups=True
26
+ filter_groups_metric=acc
27
+ max_num_gen_batches=10
28
+ train_prompt_bsz=512
29
+ gen_prompt_bsz=$((train_prompt_bsz * 3))
30
+ train_prompt_mini_bsz=32
31
+ n_resp_per_prompt=16
32
+
33
+ # Ray
34
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
35
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
36
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
37
+ NNODES=${NNODES:-4}
38
+ # Paths
39
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
40
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-Math-7B"}
41
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
42
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
43
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
44
+
45
+ # Algorithm
46
+ temperature=1.0
47
+ top_p=1.0
48
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
49
+
50
+ # Mathematically equivalent
51
+ use_dynamic_bsz=True
52
+ infer_micro_batch_size=null
53
+ train_micro_batch_size=null
54
+ offload=False
55
+
56
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
57
+ --working-dir "${WORKING_DIR}" \
58
+ -- python3 -m recipe.dapo.main_dapo \
59
+ data.train_files="${TRAIN_FILE}" \
60
+ data.val_files="${TEST_FILE}" \
61
+ data.prompt_key=prompt \
62
+ data.truncation='left' \
63
+ data.max_prompt_length=${max_prompt_length} \
64
+ data.max_response_length=${max_response_length} \
65
+ data.gen_batch_size=${gen_prompt_bsz} \
66
+ data.train_batch_size=${train_prompt_bsz} \
67
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
68
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
69
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
70
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
71
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
72
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
73
+ algorithm.adv_estimator=${adv_estimator} \
74
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
75
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
76
+ algorithm.filter_groups.enable=${enable_filter_groups} \
77
+ algorithm.filter_groups.metric=${filter_groups_metric} \
78
+ algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
79
+ actor_rollout_ref.model.use_remove_padding=True \
80
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
81
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
82
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
83
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$((max_prompt_length + max_response_length)) \
84
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$((max_prompt_length + max_response_length)) \
85
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$((max_prompt_length + max_response_length)) \
86
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
87
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
88
+ actor_rollout_ref.rollout.name=vllm \
89
+ actor_rollout_ref.actor.optim.lr=1e-6 \
90
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
91
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
92
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
93
+ actor_rollout_ref.actor.ppo_micro_batch_size=${train_micro_batch_size} \
94
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
95
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
96
+ actor_rollout_ref.actor.entropy_coeff=0 \
97
+ actor_rollout_ref.actor.grad_clip=1.0 \
98
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
99
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
100
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.85 \
101
+ actor_rollout_ref.rollout.log_prob_micro_batch_size=${infer_micro_batch_size} \
102
+ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
103
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
104
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
105
+ actor_rollout_ref.rollout.temperature=${temperature} \
106
+ actor_rollout_ref.rollout.top_p=${top_p} \
107
+ actor_rollout_ref.rollout.top_k="${top_k}" \
108
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
109
+ actor_rollout_ref.rollout.val_kwargs.top_p=${top_p} \
110
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
111
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
112
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
113
+ actor_rollout_ref.ref.log_prob_micro_batch_size=${infer_micro_batch_size} \
114
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
115
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=1 \
116
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
117
+ reward_model.reward_manager=dapo \
118
+ reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
119
+ reward_model.overlong_buffer.len=${overlong_buffer_len} \
120
+ reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
121
+ trainer.logger='["console","wandb"]' \
122
+ trainer.project_name="${project_name}" \
123
+ trainer.experiment_name="${exp_name}" \
124
+ trainer.n_gpus_per_node=8 \
125
+ trainer.nnodes="${NNODES}" \
126
+ trainer.val_before_train=True \
127
+ trainer.test_freq=2 \
128
+ trainer.save_freq=2 \
129
+ trainer.total_epochs=1 \
130
+ trainer.default_local_dir="${CKPTS_DIR}" \
131
+ trainer.resume_mode=disable
ICL/DAPO/verl-recipe/dapo/test_dapo_7b_math.sh ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Qwen2.5-7b-MATH-0527a1'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+
14
+ clip_ratio_low=0.2
15
+ clip_ratio_high=0.28
16
+
17
+ max_prompt_length=$((1024 * 2))
18
+ max_response_length=$((1024 * 8))
19
+ enable_overlong_buffer=True
20
+ overlong_buffer_len=$((1024 * 4))
21
+ overlong_penalty_factor=1.0
22
+
23
+ loss_agg_mode="token-mean"
24
+
25
+ train_prompt_bsz=512
26
+ n_resp_per_prompt=16
27
+ train_prompt_mini_bsz=32
28
+
29
+ # Ray
30
+ # RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
31
+ # WORKING_DIR=${WORKING_DIR:-"${PWD}"}
32
+ # RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
33
+ NNODES=${NNODES:-8}
34
+ NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
35
+ # Paths
36
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
37
+ # very important! please modify the max_position_embeddings in config.json to 32768 after downloading from huggingface
38
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-Math-7B"}
39
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
40
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
41
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
42
+
43
+ # Algorithm
44
+ temperature=1.0
45
+ top_p=1.0
46
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
47
+ val_top_p=0.7
48
+
49
+ # Performance Related Parameter
50
+ sp_size=4
51
+ use_dynamic_bsz=True
52
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
53
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
54
+ offload=True
55
+ gen_tp=4
56
+ fsdp_size=32
57
+
58
+ # reference run wandb: https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/runs/ow47vvon?nw=nwusertongyuxuan361
59
+
60
+ python3 -m verl.trainer.main_ppo \
61
+ data.train_files="${TRAIN_FILE}" \
62
+ data.val_files="${TEST_FILE}" \
63
+ data.prompt_key=prompt \
64
+ data.truncation='left' \
65
+ data.max_prompt_length=${max_prompt_length} \
66
+ data.max_response_length=${max_response_length} \
67
+ data.train_batch_size=${train_prompt_bsz} \
68
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
69
+ algorithm.adv_estimator=${adv_estimator} \
70
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
71
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
72
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
73
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
74
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
75
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
76
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
77
+ actor_rollout_ref.model.use_remove_padding=True \
78
+ +actor_rollout_ref.model.override_config.max_position_embeddings=32768 \
79
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
80
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
81
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
82
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
83
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
84
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
85
+ actor_rollout_ref.rollout.name=vllm \
86
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
87
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
88
+ actor_rollout_ref.actor.optim.lr=1e-6 \
89
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
90
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
91
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
92
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
93
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
94
+ actor_rollout_ref.actor.entropy_coeff=0 \
95
+ actor_rollout_ref.actor.grad_clip=1.0 \
96
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
97
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
98
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
99
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
100
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
101
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
102
+ actor_rollout_ref.rollout.temperature=${temperature} \
103
+ actor_rollout_ref.rollout.top_p=${top_p} \
104
+ actor_rollout_ref.rollout.top_k=${top_k} \
105
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
106
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
107
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
108
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
109
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
110
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
111
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
112
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
113
+ reward_model.reward_manager=dapo \
114
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
115
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
116
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
117
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
118
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
119
+ trainer.logger='["console","wandb"]' \
120
+ trainer.project_name="${project_name}" \
121
+ trainer.experiment_name="${exp_name}" \
122
+ trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
123
+ trainer.nnodes="${NNODES}" \
124
+ trainer.val_before_train=True \
125
+ trainer.test_freq=10 \
126
+ trainer.save_freq=10 \
127
+ trainer.total_epochs=10 \
128
+ trainer.total_training_steps=200 \
129
+ trainer.default_local_dir="${CKPTS_DIR}" \
130
+ trainer.resume_mode=auto \
131
+ trainer.log_val_generations=10
ICL/DAPO/verl-recipe/dapo/test_dapo_7b_math_lora.sh ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Qwen2.5-7b-MATH-0527a1'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+
14
+ clip_ratio_low=0.2
15
+ clip_ratio_high=0.28
16
+
17
+ max_prompt_length=$((1024 * 2))
18
+ max_response_length=$((1024 * 8))
19
+ enable_overlong_buffer=True
20
+ overlong_buffer_len=$((1024 * 4))
21
+ overlong_penalty_factor=1.0
22
+
23
+ loss_agg_mode="token-mean"
24
+
25
+ train_prompt_bsz=512
26
+ n_resp_per_prompt=16
27
+ train_prompt_mini_bsz=32
28
+
29
+ # Ray
30
+ # RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
31
+ # WORKING_DIR=${WORKING_DIR:-"${PWD}"}
32
+ # RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
33
+ NNODES=${NNODES:-8}
34
+ NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
35
+ # Paths
36
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
37
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-Math-7B"}
38
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
39
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
40
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
41
+
42
+ # Algorithm
43
+ temperature=1.0
44
+ top_p=1.0
45
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
46
+ val_top_p=0.7
47
+
48
+ # Performance Related Parameter
49
+ sp_size=4
50
+ use_dynamic_bsz=True
51
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
52
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
53
+ offload=True
54
+ gen_tp=4
55
+ fsdp_size=32
56
+
57
+ # remember to set VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 for this model
58
+
59
+ python3 -m verl.trainer.main_ppo \
60
+ data.train_files="${TRAIN_FILE}" \
61
+ data.val_files="${TEST_FILE}" \
62
+ data.prompt_key=prompt \
63
+ data.truncation='left' \
64
+ data.max_prompt_length=${max_prompt_length} \
65
+ data.max_response_length=${max_response_length} \
66
+ data.train_batch_size=${train_prompt_bsz} \
67
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
68
+ algorithm.adv_estimator=${adv_estimator} \
69
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
70
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
71
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
72
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
73
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
74
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
75
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
76
+ actor_rollout_ref.model.use_remove_padding=True \
77
+ +actor_rollout_ref.model.override_config.max_position_embeddings=32768 \
78
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
79
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
80
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
81
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
82
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
83
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
84
+ actor_rollout_ref.rollout.name=vllm \
85
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
86
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
87
+ actor_rollout_ref.model.lora_rank=8 \
88
+ actor_rollout_ref.actor.optim.lr=1e-6 \
89
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
90
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
91
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
92
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
93
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
94
+ actor_rollout_ref.actor.entropy_coeff=0 \
95
+ actor_rollout_ref.actor.grad_clip=1.0 \
96
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
97
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
98
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
99
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
100
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
101
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
102
+ actor_rollout_ref.rollout.temperature=${temperature} \
103
+ actor_rollout_ref.rollout.top_p=${top_p} \
104
+ actor_rollout_ref.rollout.top_k=${top_k} \
105
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
106
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
107
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
108
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
109
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
110
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
111
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
112
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
113
+ reward_model.reward_manager=dapo \
114
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
115
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
116
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
117
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
118
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
119
+ trainer.logger='["console","wandb"]' \
120
+ trainer.project_name="${project_name}" \
121
+ trainer.experiment_name="${exp_name}" \
122
+ trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
123
+ trainer.nnodes="${NNODES}" \
124
+ trainer.val_before_train=True \
125
+ trainer.test_freq=10 \
126
+ trainer.save_freq=10 \
127
+ trainer.total_epochs=10 \
128
+ trainer.total_training_steps=200 \
129
+ trainer.default_local_dir="${CKPTS_DIR}" \
130
+ trainer.resume_mode=auto \
131
+ trainer.log_val_generations=10
ICL/DAPO/verl-recipe/dapo/test_dapo_7b_math_megatron.sh ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Qwen2.5-7b-MATH-megatron-0519a1'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+
14
+ clip_ratio_low=0.2
15
+ clip_ratio_high=0.28
16
+
17
+ max_prompt_length=$((1024 * 2))
18
+ max_response_length=$((1024 * 8))
19
+ enable_overlong_buffer=True
20
+ overlong_buffer_len=$((1024 * 4))
21
+ overlong_penalty_factor=1.0
22
+
23
+ loss_agg_mode="token-mean"
24
+
25
+ train_prompt_bsz=512
26
+ n_resp_per_prompt=16
27
+ train_prompt_mini_bsz=32
28
+
29
+ # Ray
30
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
31
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
32
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
33
+ NNODES=${NNODES:-4}
34
+ # Paths
35
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
36
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-Math-7B"}
37
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
38
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
39
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
40
+
41
+ # Algorithm
42
+ temperature=1.0
43
+ top_p=1.0
44
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
45
+ val_top_p=0.7
46
+
47
+ # Performance Related Parameter
48
+ use_dynamic_bsz=True
49
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
50
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
51
+ offload=True
52
+ gen_tp=4
53
+ train_tp=4
54
+ train_pp=2
55
+
56
+ # TODO: support dynamic_bsz for megatron
57
+ # actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
58
+ # actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
59
+ # actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
60
+ # actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
61
+ # actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
62
+ # actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
63
+
64
+ python3 -m verl.trainer.main_ppo \
65
+ --config-path=config \
66
+ --config-name='ppo_megatron_trainer.yaml' \
67
+ data.train_files="${TRAIN_FILE}" \
68
+ data.val_files="${TEST_FILE}" \
69
+ data.prompt_key=prompt \
70
+ data.truncation='left' \
71
+ data.max_prompt_length=${max_prompt_length} \
72
+ data.max_response_length=${max_response_length} \
73
+ data.train_batch_size=${train_prompt_bsz} \
74
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
75
+ algorithm.adv_estimator=${adv_estimator} \
76
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
77
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
78
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
79
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
80
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
81
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
82
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
83
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
84
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
85
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
86
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
87
+ actor_rollout_ref.actor.optim.lr=1e-6 \
88
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
89
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
90
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
91
+ actor_rollout_ref.actor.megatron.param_offload=${offload} \
92
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
93
+ actor_rollout_ref.actor.megatron.grad_offload=${offload} \
94
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} \
95
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} \
96
+ actor_rollout_ref.actor.entropy_coeff=0 \
97
+ actor_rollout_ref.actor.optim.clip_grad=1.0 \
98
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
99
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
100
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
101
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
102
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
103
+ actor_rollout_ref.rollout.temperature=${temperature} \
104
+ actor_rollout_ref.rollout.top_p=${top_p} \
105
+ actor_rollout_ref.rollout.top_k=${top_k} \
106
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
107
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
108
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
109
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
110
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
111
+ actor_rollout_ref.rollout.name=vllm \
112
+ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp} \
113
+ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp} \
114
+ actor_rollout_ref.ref.megatron.param_offload=${offload} \
115
+ reward_model.reward_manager=dapo \
116
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
117
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
118
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
119
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
120
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
121
+ trainer.logger='["console","wandb"]' \
122
+ trainer.project_name="${project_name}" \
123
+ trainer.experiment_name="${exp_name}" \
124
+ trainer.n_gpus_per_node=16 \
125
+ trainer.nnodes="${NNODES}" \
126
+ trainer.val_before_train=False \
127
+ trainer.test_freq=10 \
128
+ trainer.save_freq=10 \
129
+ trainer.total_epochs=10 \
130
+ trainer.default_local_dir="${CKPTS_DIR}" \
131
+ trainer.resume_mode=auto \
132
+ trainer.log_val_generations=10
ICL/DAPO/verl-recipe/dapo/test_dapo_8b_megatron_fp16.sh ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+
5
+ rollout_mode="async"
6
+ rollout_name="vllm" # sglang or vllm
7
+ if [ "$rollout_mode" = "async" ]; then
8
+ export VLLM_USE_V1=1
9
+ return_raw_chat="True"
10
+ fi
11
+ dtype="float16" # ["bfloat16", "float16"]
12
+
13
+ project_name='DAPO-fp16'
14
+ exp_name='fp16'
15
+
16
+ adv_estimator=grpo
17
+
18
+ use_kl_in_reward=False
19
+ kl_coef=0.0
20
+ use_kl_loss=False
21
+ kl_loss_coef=0.0
22
+
23
+ clip_ratio_low=0.2
24
+ clip_ratio_high=0.28
25
+
26
+ max_prompt_length=$((1024 * 2))
27
+ max_response_length=$((1024 * 8))
28
+ enable_overlong_buffer=True
29
+ overlong_buffer_len=$((1024 * 4))
30
+ overlong_penalty_factor=1.0
31
+
32
+ loss_agg_mode="token-mean"
33
+
34
+ train_prompt_bsz=32
35
+ n_resp_per_prompt=16
36
+ train_prompt_mini_bsz=32
37
+
38
+ # Ray
39
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
40
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
41
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/verl/trainer/runtime_env.yaml"}
42
+ NNODES=${NNODES:-1}
43
+ # Paths
44
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
45
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-8B-Base"}
46
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
47
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
48
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
49
+
50
+ # Algorithm
51
+ temperature=1.0
52
+ top_p=1.0
53
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
54
+ val_top_p=0.7
55
+
56
+ # Performance Related Parameter
57
+ use_dynamic_bsz=True
58
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
59
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
60
+ offload=True
61
+ gen_tp=1
62
+ train_tp=2
63
+ train_pp=1
64
+
65
+ # TODO: support dynamic_bsz for megatron
66
+
67
+ python3 -m verl.trainer.main_ppo \
68
+ --config-path=config \
69
+ --config-name='ppo_megatron_trainer.yaml' \
70
+ data.train_files="${TRAIN_FILE}" \
71
+ data.val_files="${TEST_FILE}" \
72
+ data.prompt_key=prompt \
73
+ data.return_raw_chat=$return_raw_chat \
74
+ data.truncation='left' \
75
+ actor_rollout_ref.rollout.name=${rollout_name} \
76
+ actor_rollout_ref.rollout.mode=${rollout_mode} \
77
+ actor_rollout_ref.rollout.dtype=${dtype} \
78
+ actor_rollout_ref.actor.megatron.dtype=${dtype} \
79
+ data.max_prompt_length=${max_prompt_length} \
80
+ data.max_response_length=${max_response_length} \
81
+ data.train_batch_size=${train_prompt_bsz} \
82
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
83
+ algorithm.adv_estimator=${adv_estimator} \
84
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
85
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
86
+ actor_rollout_ref.model.use_fused_kernels=True \
87
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
88
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
89
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
90
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
91
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
92
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
93
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
94
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
95
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
96
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
97
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
98
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
99
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
100
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
101
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
102
+ actor_rollout_ref.actor.optim.lr=1e-6 \
103
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
104
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
105
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
106
+ actor_rollout_ref.actor.megatron.param_offload=${offload} \
107
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
108
+ actor_rollout_ref.actor.megatron.grad_offload=${offload} \
109
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} \
110
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} \
111
+ actor_rollout_ref.actor.entropy_coeff=0 \
112
+ actor_rollout_ref.actor.optim.clip_grad=1.0 \
113
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
114
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
115
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
116
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
117
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
118
+ actor_rollout_ref.rollout.temperature=${temperature} \
119
+ actor_rollout_ref.rollout.top_p=${top_p} \
120
+ actor_rollout_ref.rollout.top_k=${top_k} \
121
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
122
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
123
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
124
+ actor_rollout_ref.actor.megatron.use_mbridge=True \
125
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
126
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
127
+ actor_rollout_ref.rollout.calculate_log_probs=True \
128
+ +actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True \
129
+ reward_model.reward_manager=dapo \
130
+ trainer.logger=['console','wandb'] \
131
+ trainer.project_name="${project_name}" \
132
+ trainer.experiment_name="${exp_name}" \
133
+ trainer.n_gpus_per_node=8 \
134
+ trainer.nnodes="${NNODES}" \
135
+ trainer.val_before_train=False \
136
+ trainer.test_freq=10 \
137
+ trainer.save_freq=-1 \
138
+ trainer.total_epochs=10 \
139
+ trainer.default_local_dir="${CKPTS_DIR}" \
140
+ trainer.resume_mode=auto \
141
+ trainer.log_val_generations=10
142
+
ICL/DAPO/verl-recipe/dapo/test_dapo_8b_megatron_fp8train.sh ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ # need cuda12.9 or higher
5
+ # use docker://verlai/verl:dev.vllm_nightly-243ed7d32e94f00a9a32fbbc51be932f6277a55d or self build
6
+
7
+
8
+ # this env var is required for TE fp8 training
9
+ # if you are running multiple nodes, you need to set this env var in RUNTIME_ENV
10
+ export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
11
+
12
+ ################################################### quick config ###################################################
13
+
14
+
15
+ rollout_mode="sync"
16
+ rollout_name="vllm" # sglang or vllm
17
+ return_raw_chat="False"
18
+ if [ "$rollout_mode" = "async" ]; then
19
+ export VLLM_USE_V1=1
20
+ return_raw_chat="True"
21
+ fi
22
+ dtype="bfloat16" # ["bfloat16", "float16"]
23
+
24
+ project_name='DAPO'
25
+ exp_name='fp8train'
26
+
27
+ adv_estimator=grpo
28
+
29
+ use_kl_in_reward=False
30
+ kl_coef=0.0
31
+ use_kl_loss=False
32
+ kl_loss_coef=0.0
33
+
34
+ clip_ratio_low=0.2
35
+ clip_ratio_high=0.28
36
+
37
+ max_prompt_length=$((1024 * 2))
38
+ max_response_length=$((1024 * 8))
39
+ enable_overlong_buffer=True
40
+ overlong_buffer_len=$((1024 * 4))
41
+ overlong_penalty_factor=1.0
42
+
43
+ loss_agg_mode="token-mean"
44
+
45
+ train_prompt_bsz=32
46
+ n_resp_per_prompt=16
47
+ train_prompt_mini_bsz=32
48
+
49
+ # Ray
50
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
51
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
52
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/verl/trainer/runtime_env.yaml"}
53
+ NNODES=${NNODES:-1}
54
+ # Paths
55
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
56
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-8B-Base"}
57
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
58
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
59
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
60
+
61
+ # Algorithm
62
+ temperature=1.0
63
+ top_p=1.0
64
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
65
+ val_top_p=0.7
66
+
67
+ # Performance Related Parameter
68
+ use_dynamic_bsz=True
69
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
70
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
71
+ offload=True
72
+ gen_tp=1
73
+ train_tp=2
74
+ train_pp=1
75
+
76
+ ################################################### start of config ###################################################
77
+
78
+ FP8=(
79
+ +actor_rollout_ref.actor.megatron.override_transformer_config.fp8="e4m3" # e4m3 or hybrid
80
+ +actor_rollout_ref.actor.megatron.override_transformer_config.fp8_recipe="blockwise"
81
+ +actor_rollout_ref.actor.optim.override_optimizer_config.fp8_recipe="blockwise"
82
+ )
83
+
84
+ DATA=(
85
+ # dddd
86
+ data.train_files="${TRAIN_FILE}"
87
+ data.val_files="${TEST_FILE}"
88
+ data.prompt_key=prompt
89
+ data.return_raw_chat=$return_raw_chat
90
+ data.truncation='left'
91
+ data.max_prompt_length=${max_prompt_length}
92
+ data.max_response_length=${max_response_length}
93
+ data.train_batch_size=${train_prompt_bsz}
94
+ )
95
+
96
+ REWARD_MODEL=(
97
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer}
98
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len}
99
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor}
100
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False
101
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length}
102
+ reward_model.reward_manager=dapo
103
+ )
104
+
105
+ PERF_OPT=(
106
+ +actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True
107
+ actor_rollout_ref.model.use_fused_kernels=False
108
+ )
109
+
110
+ ACTOR=(
111
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
112
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
113
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low}
114
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high}
115
+ actor_rollout_ref.actor.clip_ratio_c=10.0
116
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
117
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
118
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
119
+ actor_rollout_ref.actor.optim.lr=1e-6
120
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10
121
+ actor_rollout_ref.actor.optim.weight_decay=0.1
122
+ actor_rollout_ref.actor.optim.clip_grad=1.0
123
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
124
+ actor_rollout_ref.actor.megatron.param_offload=${offload}
125
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload}
126
+ actor_rollout_ref.actor.megatron.grad_offload=${offload}
127
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp}
128
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp}
129
+ actor_rollout_ref.actor.entropy_coeff=0
130
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode}
131
+ actor_rollout_ref.actor.megatron.use_mbridge=True
132
+ )
133
+
134
+ ROLLOUT=(
135
+ actor_rollout_ref.rollout.name=${rollout_name}
136
+ actor_rollout_ref.rollout.mode=${rollout_mode}
137
+ actor_rollout_ref.rollout.dtype=${dtype}
138
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80
139
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
140
+ actor_rollout_ref.rollout.enable_chunked_prefill=True
141
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length))
142
+ actor_rollout_ref.rollout.temperature=${temperature}
143
+ actor_rollout_ref.rollout.top_p=${top_p}
144
+ actor_rollout_ref.rollout.top_k=${top_k}
145
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
146
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p}
147
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
148
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True
149
+ actor_rollout_ref.rollout.val_kwargs.n=1
150
+ actor_rollout_ref.rollout.calculate_log_probs=True
151
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt}
152
+ )
153
+
154
+ TRAINER=(
155
+ trainer.logger=['console','wandb']
156
+ trainer.project_name="${project_name}"
157
+ trainer.experiment_name="${exp_name}"
158
+ trainer.n_gpus_per_node=8
159
+ trainer.nnodes="${NNODES}"
160
+ trainer.val_before_train=False
161
+ trainer.test_freq=10
162
+ trainer.save_freq=-1
163
+ trainer.total_epochs=10
164
+ trainer.default_local_dir="${CKPTS_DIR}"
165
+ trainer.resume_mode=auto
166
+ trainer.log_val_generations=10
167
+ )
168
+
169
+ FORWARD_ONLY_SETS=(
170
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4
171
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4
172
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
173
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
174
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
175
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
176
+ )
177
+
178
+ MODEL=(
179
+ actor_rollout_ref.model.path="${MODEL_PATH}"
180
+ )
181
+
182
+ ALGORITHM=(
183
+ algorithm.adv_estimator=${adv_estimator}
184
+ algorithm.use_kl_in_reward=${use_kl_in_reward}
185
+ algorithm.kl_ctrl.kl_coef=${kl_coef}
186
+ )
187
+ ################################################### start script ###################################################
188
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
189
+ -- python3 -m verl.trainer.main_ppo \
190
+ --config-path=config \
191
+ --config-name='ppo_megatron_trainer.yaml' \
192
+ "${DATA[@]}" \
193
+ "${ALGORITHM[@]}" \
194
+ "${MODEL[@]}" \
195
+ "${ROLLOUT[@]}" \
196
+ "${ACTOR[@]}" \
197
+ "${REWARD_MODEL[@]}" \
198
+ "${FP8[@]}" \
199
+ "${PERF_OPT[@]}" \
200
+ "${TRAINER[@]}" \
201
+ "${FORWARD_ONLY_SETS[@]}" \
ICL/DAPO/verl-recipe/dapo/test_dapo_dspk_671b_megatron_96gb.sh ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ # 0. download the config
5
+ # only need to download the configuration_deepseek.py and config.json
6
+ # remove the `quantization_config` in the `config.json`
7
+ # set `num_nextn_predict_layers=0` to disable MTP, which is not currently supported
8
+ huggingface-cli download deepseek-ai/DeepSeek-V3-0324 configuration_deepseek.py config.json
9
+
10
+ project_name='DAPO'
11
+ exp_name='DAPO-DeepSeek-671b-megatron'
12
+
13
+ adv_estimator=grpo
14
+
15
+ use_kl_in_reward=False
16
+ kl_coef=0.0
17
+ use_kl_loss=False
18
+ kl_loss_coef=0.0
19
+
20
+ clip_ratio_low=0.2
21
+ clip_ratio_high=0.28
22
+
23
+ max_prompt_length=$((1024 * 2))
24
+ max_response_length=$((1024 * 8))
25
+ enable_overlong_buffer=False
26
+ overlong_buffer_len=$((1024 * 4))
27
+ overlong_penalty_factor=0.1
28
+
29
+ loss_agg_mode="token-mean"
30
+
31
+ train_prompt_bsz=256 # must be > n_gpus. need to fix
32
+ n_resp_per_prompt=16
33
+ train_prompt_mini_bsz=32 # mini_bsz * n >= micro_bsz * pp * dp
34
+
35
+ NNODES=${NNODES:-64}
36
+
37
+ # 1. download the dist_ckpt format model from https://huggingface.co/BearBiscuit05/dpsk-v3-671B-BF16-dist_ckpt/tree/main
38
+ # change the MODEL_PATH and MCORE_MODEL_PATH to your own path
39
+ # Paths
40
+ MODEL_PATH="<path_to_dsv3_config>"
41
+ MCORE_MODEL_PATH="<path_to_dpsk-v3-671B-BF16-dist_ckpt>"
42
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
43
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
44
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
45
+ aime24_test_path=${RAY_DATA_HOME}/data/aime-2024.parquet
46
+ # TEST_FILE="['$math500_test_path', '$aime24_test_path']"
47
+
48
+ TEST_FILE="['$aime24_test_path']"
49
+
50
+ # Algorithm
51
+ temperature=1.0
52
+ top_p=1.0
53
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
54
+ val_top_p=0.7
55
+
56
+ # Performance Related Parameter
57
+ use_dynamic_bsz=True
58
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
59
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
60
+ offload=True
61
+ gen_tp=32
62
+ train_tp=1
63
+ train_ep=32
64
+ train_pp=16
65
+
66
+ python3 -m verl.trainer.main_ppo \
67
+ --config-path=config \
68
+ --config-name='ppo_megatron_trainer.yaml' \
69
+ data.train_files="${TRAIN_FILE}" \
70
+ data.val_files="${TEST_FILE}" \
71
+ data.prompt_key=prompt \
72
+ data.truncation='left' \
73
+ data.max_prompt_length=${max_prompt_length} \
74
+ data.max_response_length=${max_response_length} \
75
+ data.train_batch_size=${train_prompt_bsz} \
76
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
77
+ algorithm.adv_estimator=${adv_estimator} \
78
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
79
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
80
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
81
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
82
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
83
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
84
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
85
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
86
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
87
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
88
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
89
+ actor_rollout_ref.actor.optim.lr=1e-6 \
90
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
91
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
92
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
93
+ actor_rollout_ref.actor.megatron.param_offload=${offload} \
94
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
95
+ actor_rollout_ref.actor.megatron.grad_offload=${offload} \
96
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} \
97
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} \
98
+ actor_rollout_ref.actor.megatron.expert_model_parallel_size=${train_ep} \
99
+ actor_rollout_ref.actor.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH} \
100
+ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
101
+ +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=3 \
102
+ +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=2 \
103
+ actor_rollout_ref.actor.entropy_coeff=0 \
104
+ actor_rollout_ref.actor.optim.clip_grad=1.0 \
105
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
106
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
107
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
108
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
109
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
110
+ actor_rollout_ref.rollout.temperature=${temperature} \
111
+ actor_rollout_ref.rollout.top_p=${top_p} \
112
+ actor_rollout_ref.rollout.top_k=${top_k} \
113
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
114
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
115
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
116
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
117
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
118
+ actor_rollout_ref.rollout.name=vllm \
119
+ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp} \
120
+ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp} \
121
+ actor_rollout_ref.ref.megatron.expert_model_parallel_size=${train_ep} \
122
+ actor_rollout_ref.ref.megatron.param_offload=${offload} \
123
+ actor_rollout_ref.ref.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH} \
124
+ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
125
+ reward_model.reward_manager=dapo \
126
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
127
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
128
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
129
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
130
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
131
+ trainer.logger='["console","wandb"]' \
132
+ trainer.project_name="${project_name}" \
133
+ trainer.experiment_name="${exp_name}" \
134
+ trainer.n_gpus_per_node=8 \
135
+ trainer.nnodes="${NNODES}" \
136
+ trainer.val_before_train=False \
137
+ trainer.test_freq=5 \
138
+ trainer.save_freq=5 \
139
+ trainer.total_epochs=10 \
140
+ trainer.total_training_steps=10 \
141
+ trainer.default_local_dir="${CKPTS_DIR}" \
142
+ trainer.resume_mode=auto \
143
+ trainer.log_val_generations=10
ICL/DAPO/verl-recipe/dapo/test_dapo_glm_air_megatron.sh ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ NNODES=${NNODES:-8}
5
+ NGPUS_PER_NODES=${NGPUS_PER_NODES:-8}
6
+
7
+ project_name='DAPO'
8
+ exp_name='DAPO-GLM-AIR-MATH-megatron'
9
+
10
+ adv_estimator=grpo
11
+
12
+ use_kl_in_reward=False
13
+ kl_coef=0.0
14
+ use_kl_loss=False
15
+ kl_loss_coef=0.0
16
+
17
+ clip_ratio_low=0.2
18
+ clip_ratio_high=0.28
19
+
20
+ max_prompt_length=$((1024 * 2))
21
+ max_response_length=$((1024 * 8))
22
+ enable_overlong_buffer=True
23
+ overlong_buffer_len=$((1024 * 4))
24
+ overlong_penalty_factor=1.0
25
+
26
+ loss_agg_mode="token-mean"
27
+
28
+ train_prompt_bsz=512
29
+ n_resp_per_prompt=16
30
+ train_prompt_mini_bsz=128
31
+ train_ppo_micro_batch_size_per_gpu=2
32
+ infer_ppo_micro_batch_size_per_gpu=2
33
+ # Paths
34
+ MODEL_PATH=/models/zai-org/GLM-4.5-Air-Base
35
+ # GLM Base model can use chat_template.jinja from instruct models
36
+ cp /models/zai-org/GLM-4.5-Air/chat_template.jinja ${MODEL_PATH}/chat_template.jinja
37
+
38
+ TRAIN_FILE=/data/dapo/dapo-math-17k.parquet
39
+ aime24_test_path=/data/dapo/aime-2024.parquet
40
+ # math500_test_path=/data/rlhf/math500/test.parquet
41
+
42
+ # TEST_FILE="['$math500_test_path', '$aime24_test_path']"
43
+
44
+ TEST_FILE="['$aime24_test_path']"
45
+
46
+ # Algorithm
47
+ temperature=1.0
48
+ top_p=1.0
49
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
50
+ val_top_p=0.7
51
+
52
+ # Performance Related Parameter
53
+ use_dynamic_bsz=True
54
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length)))
55
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length)))
56
+ offload=True
57
+
58
+ COMMON_PP=${COMMON_PP:-2}
59
+ COMMON_VPP=${COMMON_VPP:-null}
60
+ COMMON_CP=${COMMON_CP:-4}
61
+ COMMON_TP=${COMMON_TP:-2}
62
+ COMMON_EP=${COMMON_EP:-8}
63
+ COMMON_ETP=${COMMON_ETP:-1}
64
+
65
+ TRAIN_TP=${TRAIN_TP:-$COMMON_TP}
66
+ INFER_TP=${INFER_TP:-8}
67
+
68
+ ACTOR_PP=${ACTOR_PP:-$COMMON_PP}
69
+ ACTOR_VPP=${ACTOR_VPP:-$COMMON_VPP}
70
+ ACTOR_CP=${ACTOR_CP:-$COMMON_CP}
71
+ ACTOR_TP=${ACTOR_TP:-$TRAIN_TP}
72
+ ACTOR_EP=${ACTOR_EP:-$COMMON_EP}
73
+ ACTOR_ETP=${ACTOR_ETP:-$COMMON_ETP}
74
+ ROLLOUT_TP=${ROLLOUT_TP:-$INFER_TP}
75
+ REF_PP=${REF_PP:-$COMMON_PP}
76
+ REF_VPP=${REF_VPP:-$COMMON_VPP}
77
+ REF_CP=${REF_CP:-$COMMON_CP}
78
+ REF_TP=${REF_TP:-$TRAIN_TP}
79
+ REF_EP=${REF_EP:-$COMMON_EP}
80
+ REF_ETP=${REF_ETP:-$COMMON_ETP}
81
+ CRITIC_PP=${CRITIC_PP:-$COMMON_PP}
82
+ CRITIC_VPP=${CRITIC_VPP:-$COMMON_VPP}
83
+ CRITIC_CP=${CRITIC_CP:-$COMMON_CP}
84
+ CRITIC_TP=${CRITIC_TP:-$TRAIN_TP}
85
+ CRITIC_EP=${CRITIC_EP:-$COMMON_EP}
86
+ CRITIC_ETP=${CRITIC_ETP:-$COMMON_ETP}
87
+ RM_PP=${RM_PP:-$COMMON_PP}
88
+ RM_VPP=${RM_VPP:-$COMMON_VPP}
89
+ RM_CP=${RM_CP:-$COMMON_CP}
90
+ RM_TP=${RM_TP:-$TRAIN_TP}
91
+ RM_EP=${RM_EP:-$COMMON_EP}
92
+ RM_ETP=${RM_ETP:-$COMMON_ETP}
93
+
94
+ USE_MBRIDGE=True
95
+ USE_DIST_CKPT=False
96
+
97
+ # Install the latest mbridge
98
+ # pip install --no-cache-dir git+https://github.com/ISEEKYAN/mbridge.git
99
+
100
+ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
101
+ data.train_files="${TRAIN_FILE}" \
102
+ data.val_files="${TEST_FILE}" \
103
+ data.prompt_key=prompt \
104
+ data.truncation='left' \
105
+ data.max_prompt_length=${max_prompt_length} \
106
+ data.max_response_length=${max_response_length} \
107
+ data.train_batch_size=${train_prompt_bsz} \
108
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
109
+ algorithm.adv_estimator=${adv_estimator} \
110
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
111
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
112
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
113
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
114
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
115
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
116
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
117
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
118
+ +actor_rollout_ref.model.override_config.model_config.max_position_embeddings=$((max_prompt_length + max_response_length)) \
119
+ actor_rollout_ref.model.use_fused_kernels=True \
120
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
121
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
122
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_ppo_micro_batch_size_per_gpu} \
123
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
124
+ actor_rollout_ref.actor.optim.lr=1e-6 \
125
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
126
+ actor_rollout_ref.actor.optim.lr_decay_style='constant' \
127
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
128
+ actor_rollout_ref.actor.megatron.use_mbridge=$USE_MBRIDGE \
129
+ actor_rollout_ref.actor.megatron.use_dist_checkpointing=$USE_DIST_CKPT \
130
+ actor_rollout_ref.actor.megatron.param_offload=${offload} \
131
+ actor_rollout_ref.actor.megatron.grad_offload=${offload} \
132
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
133
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${ACTOR_TP} \
134
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${ACTOR_PP} \
135
+ actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=${ACTOR_VPP} \
136
+ actor_rollout_ref.actor.megatron.context_parallel_size=${ACTOR_CP} \
137
+ actor_rollout_ref.actor.megatron.expert_model_parallel_size=${ACTOR_EP} \
138
+ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=${ACTOR_ETP} \
139
+ actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity="selective" \
140
+ actor_rollout_ref.actor.megatron.override_transformer_config.recompute_modules=["core_attn","moe_act","layernorm","mlp","moe"] \
141
+ +actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True \
142
+ +actor_rollout_ref.actor.megatron.override_transformer_config.masked_softmax_fusion=True \
143
+ +actor_rollout_ref.actor.megatron.override_transformer_config.bias_activation_fusion=True \
144
+ +actor_rollout_ref.actor.megatron.override_transformer_config.bias_dropout_fusion=True \
145
+ +actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True \
146
+ +actor_rollout_ref.actor.megatron.override_transformer_config.deallocate_pipeline_outputs=True \
147
+ +actor_rollout_ref.actor.megatron.override_transformer_config.persist_layer_norm=True \
148
+ +actor_rollout_ref.actor.megatron.override_transformer_config.moe_grouped_gemm=True \
149
+ +actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True \
150
+ +actor_rollout_ref.actor.megatron.override_transformer_config.moe_shared_expert_overlap=False \
151
+ +actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type="flex" \
152
+ +actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32 \
153
+ +actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=False \
154
+ actor_rollout_ref.actor.entropy_coeff=0 \
155
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
156
+ actor_rollout_ref.rollout.name='vllm' \
157
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${infer_ppo_micro_batch_size_per_gpu} \
158
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
159
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
160
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${INFER_TP} \
161
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
162
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
163
+ actor_rollout_ref.rollout.temperature=${temperature} \
164
+ actor_rollout_ref.rollout.top_p=${top_p} \
165
+ actor_rollout_ref.rollout.top_k=${top_k} \
166
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
167
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
168
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
169
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
170
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
171
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${infer_ppo_micro_batch_size_per_gpu} \
172
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
173
+ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
174
+ actor_rollout_ref.ref.megatron.param_offload=${offload} \
175
+ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${REF_TP} \
176
+ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${REF_PP} \
177
+ actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=${REF_VPP} \
178
+ actor_rollout_ref.ref.megatron.context_parallel_size=${REF_CP} \
179
+ actor_rollout_ref.ref.megatron.expert_model_parallel_size=${REF_EP} \
180
+ actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=${REF_ETP} \
181
+ reward_model.reward_manager=dapo \
182
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
183
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
184
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
185
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
186
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
187
+ trainer.logger=['console','wandb'] \
188
+ trainer.project_name="${project_name}" \
189
+ trainer.experiment_name="${exp_name}" \
190
+ trainer.n_gpus_per_node="${NGPUS_PER_NODES}" \
191
+ trainer.nnodes="${NNODES}" \
192
+ trainer.val_before_train=False \
193
+ trainer.test_freq=10 \
194
+ trainer.save_freq=100 \
195
+ trainer.total_epochs=10 \
196
+ trainer.resume_mode=auto \
197
+ trainer.log_val_generations=10
ICL/DAPO/verl-recipe/dapo/test_dapo_gptoss_20b_megatron.sh ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ ################################################### document for gptoss ###################################################
5
+
6
+ ####################### running environment: #######################
7
+ # option 1: use pre-built images verlai/verl:vll012.exp or verlai/verl:sgl056.exp
8
+ #
9
+ # option 2: self build TE>=2.8 with CUDNN>=9.13.1, megatron with branch `core_dev_r0.15.0`, latest vllm or sglang
10
+ # you can modify the dockerfile to build the image, see Dockerfile at https://github.com/volcengine/verl/blob/main/docker/Dockerfile.stable.vllm or https://github.com/volcengine/verl/blob/main/docker/Dockerfile.stable.sglang
11
+
12
+
13
+ ####################### before training: #######################
14
+ # # install matched mbridge version
15
+ # pip uninstall -y mbridge && pip install git+https://github.com/ISEEKYAN/mbridge@gpt-oss
16
+
17
+ # # convert gptoss to bf16
18
+ cat > get_model.py << EOF
19
+ import torch
20
+ from transformers import AutoModelForCausalLM, AutoTokenizer, Mxfp4Config
21
+
22
+ model_id = "openai/gpt-oss-20b"
23
+ output_dir = "$HOME/models/gpt-oss-20b-bf16"
24
+
25
+ quantization_config = Mxfp4Config(dequantize=True)
26
+ model_kwargs = dict(
27
+ attn_implementation="eager",
28
+ torch_dtype=torch.bfloat16,
29
+ quantization_config=quantization_config,
30
+ use_cache=False,
31
+ device_map="auto",
32
+ )
33
+
34
+ model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
35
+
36
+ # Patch config with custom attribute before saving
37
+ model.config.attn_implementation = "eager"
38
+
39
+ model.save_pretrained(output_dir)
40
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
41
+ tokenizer.save_pretrained(output_dir)
42
+ EOF
43
+
44
+ python get_model.py
45
+
46
+ ####################### specific training config: #######################
47
+
48
+ GPT_OSS_CONFIG=(
49
+ # only support mbridge for gptoss
50
+ actor_rollout_ref.actor.megatron.use_mbridge=True
51
+ # for now (latest TE=2.10), gptoss's optimized attn kernel is not supported for thd format, so we use bshd format here
52
+ # when bshd format is used, we need to pad the input_ids to the longest sequence length
53
+ # so we recommend to disable dynamic batch size and set micro batch size to 1 to avoid paddings
54
+ # but it is ok to try with micro_batch_size>1
55
+ actor_rollout_ref.actor.megatron.use_remove_padding=False
56
+ )
57
+ use_dynamic_bsz=False # recommended but not necessary
58
+
59
+ ################################################### quick config ###################################################
60
+
61
+ rollout_mode="async"
62
+ rollout_name="vllm" # sglang or vllm
63
+ export VLLM_USE_V1=1
64
+ return_raw_chat="True"
65
+ dtype="bfloat16" # ["bfloat16", "float16"]
66
+
67
+ project_name='DAPO'
68
+ exp_name='gptoss'
69
+
70
+ adv_estimator=grpo
71
+
72
+ use_kl_in_reward=False
73
+ kl_coef=0.0
74
+ use_kl_loss=False
75
+ kl_loss_coef=0.0
76
+
77
+ clip_ratio_low=0.2
78
+ clip_ratio_high=0.28
79
+
80
+ max_prompt_length=$((1024 * 2))
81
+ max_response_length=$((1024 * 8))
82
+ enable_overlong_buffer=True
83
+ overlong_buffer_len=$((1024 * 4))
84
+ overlong_penalty_factor=1.0
85
+
86
+ loss_agg_mode="token-mean"
87
+
88
+ train_prompt_bsz=32
89
+ n_resp_per_prompt=16
90
+ train_prompt_mini_bsz=32
91
+
92
+ # Ray
93
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
94
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
95
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/verl/trainer/runtime_env.yaml"}
96
+ NNODES=${NNODES:-1}
97
+ # Paths
98
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
99
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/gpt-oss-20b"}
100
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
101
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
102
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
103
+
104
+ # Algorithm
105
+ temperature=1.0
106
+ top_p=1.0
107
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
108
+ val_top_p=0.7
109
+
110
+ # Performance Related Parameter
111
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
112
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
113
+ offload=True
114
+ gen_tp=4
115
+ train_tp=4
116
+ EP=8
117
+ ETP=1
118
+ train_pp=1
119
+
120
+ ################################################### start of config ###################################################
121
+
122
+
123
+ DATA=(
124
+ data.train_files="${TRAIN_FILE}"
125
+ data.val_files="${TEST_FILE}"
126
+ data.prompt_key=prompt
127
+ data.return_raw_chat=$return_raw_chat
128
+ data.truncation='left'
129
+ data.max_prompt_length=${max_prompt_length}
130
+ data.max_response_length=${max_response_length}
131
+ data.train_batch_size=${train_prompt_bsz}
132
+ )
133
+
134
+ REWARD_MODEL=(
135
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer}
136
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len}
137
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor}
138
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False
139
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length}
140
+ reward_model.reward_manager=dapo
141
+ )
142
+
143
+ PERF_OPT=(
144
+ +actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True
145
+ actor_rollout_ref.model.use_fused_kernels=False
146
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
147
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
148
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
149
+ actor_rollout_ref.actor.megatron.override_transformer_config.attention_backend=auto
150
+ +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=1
151
+ +actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
152
+ +actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
153
+ +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
154
+ )
155
+
156
+ ACTOR=(
157
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
158
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
159
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low}
160
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high}
161
+ actor_rollout_ref.actor.clip_ratio_c=10.0
162
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
163
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
164
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
165
+ actor_rollout_ref.actor.optim.lr=1e-6
166
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10
167
+ actor_rollout_ref.actor.optim.weight_decay=0.1
168
+ actor_rollout_ref.actor.optim.clip_grad=1.0
169
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
170
+ actor_rollout_ref.actor.megatron.param_offload=${offload}
171
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload}
172
+ actor_rollout_ref.actor.megatron.grad_offload=${offload}
173
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp}
174
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp}
175
+ actor_rollout_ref.actor.megatron.expert_model_parallel_size=${EP}
176
+ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=${ETP}
177
+ actor_rollout_ref.actor.entropy_coeff=0
178
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode}
179
+ )
180
+
181
+ ROLLOUT=(
182
+ actor_rollout_ref.rollout.name=${rollout_name}
183
+ actor_rollout_ref.rollout.mode=${rollout_mode}
184
+ actor_rollout_ref.rollout.dtype=${dtype}
185
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.70
186
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
187
+ actor_rollout_ref.rollout.enable_chunked_prefill=True
188
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length))
189
+ actor_rollout_ref.rollout.temperature=${temperature}
190
+ actor_rollout_ref.rollout.top_p=${top_p}
191
+ actor_rollout_ref.rollout.top_k=${top_k}
192
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
193
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p}
194
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
195
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True
196
+ actor_rollout_ref.rollout.val_kwargs.n=1
197
+ actor_rollout_ref.rollout.calculate_log_probs=True
198
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt}
199
+ )
200
+
201
+ TRAINER=(
202
+ trainer.logger=['console','wandb']
203
+ trainer.project_name="${project_name}"
204
+ trainer.experiment_name="${exp_name}"
205
+ trainer.n_gpus_per_node=8
206
+ trainer.nnodes="${NNODES}"
207
+ trainer.val_before_train=False
208
+ trainer.test_freq=10
209
+ trainer.save_freq=-1
210
+ trainer.total_epochs=10
211
+ trainer.default_local_dir="${CKPTS_DIR}"
212
+ trainer.resume_mode=auto
213
+ trainer.log_val_generations=10
214
+ )
215
+
216
+ FORWARD_ONLY_SETS=(
217
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4
218
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4
219
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
220
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
221
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
222
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
223
+ )
224
+
225
+ MODEL=(
226
+ actor_rollout_ref.model.path="${MODEL_PATH}"
227
+ )
228
+
229
+ ALGORITHM=(
230
+ algorithm.adv_estimator=${adv_estimator}
231
+ algorithm.use_kl_in_reward=${use_kl_in_reward}
232
+ algorithm.kl_ctrl.kl_coef=${kl_coef}
233
+ )
234
+ ################################################### start script ###################################################
235
+ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
236
+ -- python3 -m verl.trainer.main_ppo \
237
+ --config-path=config \
238
+ --config-name='ppo_megatron_trainer.yaml' \
239
+ "${DATA[@]}" \
240
+ "${ALGORITHM[@]}" \
241
+ "${MODEL[@]}" \
242
+ "${ROLLOUT[@]}" \
243
+ "${ACTOR[@]}" \
244
+ "${REWARD_MODEL[@]}" \
245
+ "${PERF_OPT[@]}" \
246
+ "${TRAINER[@]}" \
247
+ "${GPT_OSS_CONFIG[@]}" \
248
+ "${FORWARD_ONLY_SETS[@]}" \
ICL/DAPO/verl-recipe/dapo/test_dapo_qwen3_30b_math.sh ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Qwen3-30B-A3B-Base-MATH-0527a1'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+
14
+ clip_ratio_low=0.2
15
+ clip_ratio_high=0.28
16
+
17
+ max_prompt_length=$((1024 * 2))
18
+ max_response_length=$((1024 * 8))
19
+ enable_overlong_buffer=True
20
+ overlong_buffer_len=$((1024 * 4))
21
+ overlong_penalty_factor=1.0
22
+
23
+ loss_agg_mode="token-mean"
24
+
25
+ train_prompt_bsz=512
26
+ n_resp_per_prompt=16
27
+ train_prompt_mini_bsz=32
28
+
29
+ # Ray
30
+ # RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
31
+ # WORKING_DIR=${WORKING_DIR:-"${PWD}"}
32
+ # RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
33
+ NNODES=${NNODES:-8}
34
+ NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
35
+ # Paths
36
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
37
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-30B-A3B-Base"}
38
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
39
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
40
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
41
+
42
+ # Algorithm
43
+ temperature=1.0
44
+ top_p=1.0
45
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
46
+ val_top_p=0.7
47
+
48
+ # Performance Related Parameter
49
+ sp_size=4
50
+ use_dynamic_bsz=True
51
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
52
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
53
+ offload=True
54
+ gen_tp=4
55
+ fsdp_size=32
56
+
57
+ python3 -m verl.trainer.main_ppo \
58
+ data.train_files="${TRAIN_FILE}" \
59
+ data.val_files="${TEST_FILE}" \
60
+ data.prompt_key=prompt \
61
+ data.truncation='left' \
62
+ data.max_prompt_length=${max_prompt_length} \
63
+ data.max_response_length=${max_response_length} \
64
+ data.train_batch_size=${train_prompt_bsz} \
65
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
66
+ algorithm.adv_estimator=${adv_estimator} \
67
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
68
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
69
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
70
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
71
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
72
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
73
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
74
+ actor_rollout_ref.model.use_remove_padding=True \
75
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
76
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
77
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
78
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
79
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
80
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
81
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
82
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
83
+ actor_rollout_ref.actor.optim.lr=1e-6 \
84
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
85
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
86
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
87
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
88
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
89
+ actor_rollout_ref.actor.entropy_coeff=0 \
90
+ actor_rollout_ref.actor.grad_clip=1.0 \
91
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
92
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
93
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
94
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
95
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
96
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
97
+ actor_rollout_ref.rollout.temperature=${temperature} \
98
+ actor_rollout_ref.rollout.top_p=${top_p} \
99
+ actor_rollout_ref.rollout.top_k=${top_k} \
100
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
101
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
102
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
103
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
104
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
105
+ actor_rollout_ref.rollout.name=vllm \
106
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
107
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
108
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
109
+ reward_model.reward_manager=dapo \
110
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
111
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
112
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
113
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
114
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
115
+ trainer.logger='["console","wandb"]' \
116
+ trainer.project_name="${project_name}" \
117
+ trainer.experiment_name="${exp_name}" \
118
+ trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
119
+ trainer.nnodes="${NNODES}" \
120
+ trainer.val_before_train=True \
121
+ trainer.test_freq=10 \
122
+ trainer.save_freq=10 \
123
+ trainer.total_epochs=10 \
124
+ trainer.total_training_steps=300 \
125
+ trainer.default_local_dir="${CKPTS_DIR}" \
126
+ trainer.resume_mode=auto \
127
+ trainer.log_val_generations=10
ICL/DAPO/verl-recipe/dapo/test_dapo_qwen3_30b_math_single_node.sh ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ project_name='DAPO'
5
+ exp_name='DAPO-Qwen3-30B-A3B-Base-MATH-0719a1'
6
+
7
+ adv_estimator=grpo
8
+
9
+ use_kl_in_reward=False
10
+ kl_coef=0.0
11
+ use_kl_loss=False
12
+ kl_loss_coef=0.0
13
+
14
+ clip_ratio_low=0.2
15
+ clip_ratio_high=0.28
16
+
17
+ max_prompt_length=$((1024 * 2))
18
+ max_response_length=$((1024 * 4))
19
+ enable_overlong_buffer=False
20
+ overlong_buffer_len=$((1024 * 4))
21
+ overlong_penalty_factor=0.1
22
+
23
+ loss_agg_mode="token-mean"
24
+
25
+ train_prompt_bsz=64
26
+ n_resp_per_prompt=16
27
+ train_prompt_mini_bsz=16
28
+
29
+ # Ray
30
+ # RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
31
+ # WORKING_DIR=${WORKING_DIR:-"${PWD}"}
32
+ # RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
33
+ NNODES=${NNODES:-1}
34
+ NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
35
+ # Paths
36
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
37
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-30B-A3B-Base"}
38
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
39
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
40
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
41
+
42
+ # Algorithm
43
+ temperature=1.0
44
+ top_p=1.0
45
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
46
+ val_top_p=0.7
47
+
48
+ # Performance Related Parameter
49
+ sp_size=4
50
+ use_dynamic_bsz=True
51
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
52
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
53
+ offload=True
54
+ gen_tp=4
55
+ fsdp_size=8
56
+
57
+ python3 -m verl.trainer.main_ppo \
58
+ data.train_files="${TRAIN_FILE}" \
59
+ data.val_files="${TEST_FILE}" \
60
+ data.prompt_key=prompt \
61
+ data.truncation='left' \
62
+ data.max_prompt_length=${max_prompt_length} \
63
+ data.max_response_length=${max_response_length} \
64
+ data.train_batch_size=${train_prompt_bsz} \
65
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
66
+ algorithm.adv_estimator=${adv_estimator} \
67
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
68
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
69
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
70
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
71
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
72
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
73
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
74
+ actor_rollout_ref.model.use_remove_padding=True \
75
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
76
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
77
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
78
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
79
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
80
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
81
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
82
+ actor_rollout_ref.model.enable_gradient_checkpointing=True \
83
+ actor_rollout_ref.actor.optim.lr=1e-6 \
84
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
85
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
86
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
87
+ actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
88
+ actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
89
+ actor_rollout_ref.actor.entropy_coeff=0 \
90
+ actor_rollout_ref.actor.grad_clip=1.0 \
91
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
92
+ actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
93
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
94
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
95
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
96
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
97
+ actor_rollout_ref.rollout.temperature=${temperature} \
98
+ actor_rollout_ref.rollout.top_p=${top_p} \
99
+ actor_rollout_ref.rollout.top_k=${top_k} \
100
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
101
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
102
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
103
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
104
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
105
+ actor_rollout_ref.rollout.name=vllm \
106
+ actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
107
+ actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
108
+ actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
109
+ reward_model.reward_manager=dapo \
110
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
111
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
112
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
113
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
114
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
115
+ trainer.logger='["console","wandb"]' \
116
+ trainer.project_name="${project_name}" \
117
+ trainer.experiment_name="${exp_name}" \
118
+ trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
119
+ trainer.nnodes="${NNODES}" \
120
+ trainer.val_before_train=True \
121
+ trainer.test_freq=10 \
122
+ trainer.save_freq=-1 \
123
+ trainer.total_epochs=10 \
124
+ trainer.total_training_steps=300 \
125
+ trainer.default_local_dir="${CKPTS_DIR}" \
126
+ trainer.resume_mode=auto \
127
+ trainer.log_val_generations=10
ICL/DAPO/verl-recipe/dapo/test_dapo_qwen3_moe_30b_megatron_fp16.sh ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+ rollout_mode="async"
5
+ rollout_name="vllm" # sglang or vllm
6
+ if [ "$rollout_mode" = "async" ]; then
7
+ export VLLM_USE_V1=1
8
+ return_raw_chat="True"
9
+ else
10
+ return_raw_chat="False"
11
+ fi
12
+
13
+ dtype="float16" # ["bfloat16", "float16"]
14
+
15
+ adv_estimator=grpo
16
+
17
+ use_kl_in_reward=False
18
+ kl_coef=0.0
19
+ use_kl_loss=False
20
+ kl_loss_coef=0.0
21
+
22
+ clip_ratio_low=0.2
23
+ clip_ratio_high=0.28
24
+
25
+ train_prompt_bsz=32
26
+ n_resp_per_prompt=16
27
+ train_prompt_mini_bsz=32
28
+
29
+ NNODES=4
30
+
31
+ # Ray
32
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:6379"}
33
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
34
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/verl/trainer/runtime_env.yaml"}
35
+ NNODES=${NNODES:-1}
36
+
37
+ project_name='DAPO-moe-fp16'
38
+ exp_name='qwen3moe_30b_a3b_fp16'
39
+
40
+ # Paths
41
+ RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
42
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-30B-A3B"}
43
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/checkpoints/${project_name}/${exp_name}"}
44
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
45
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
46
+
47
+ # Algorithm
48
+ temperature=1.0
49
+ top_p=1.0
50
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
51
+ val_top_p=0.7
52
+
53
+ max_prompt_length=$((1024 * 2))
54
+ max_response_length=$((1024 * 8))
55
+ enable_overlong_buffer=True
56
+ overlong_buffer_len=$((1024 * 4))
57
+ overlong_penalty_factor=1.0
58
+
59
+ loss_agg_mode="token-mean"
60
+
61
+ # Performance Related Parameter
62
+ use_dynamic_bsz=False
63
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
64
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
65
+ offload=True
66
+ gen_tp=8
67
+ train_tp=4
68
+ train_pp=4
69
+ train_ep=8
70
+
71
+
72
+ python3 -m verl.trainer.main_ppo \
73
+ --config-path=config \
74
+ --config-name='ppo_megatron_trainer.yaml' \
75
+ reward_model.reward_manager=dapo \
76
+ data.train_files="${TRAIN_FILE}" \
77
+ data.val_files="${TEST_FILE}" \
78
+ data.prompt_key=prompt \
79
+ data.return_raw_chat=$return_raw_chat \
80
+ data.truncation='left' \
81
+ actor_rollout_ref.rollout.name=${rollout_name} \
82
+ actor_rollout_ref.rollout.mode=${rollout_mode} \
83
+ actor_rollout_ref.rollout.dtype=${dtype} \
84
+ actor_rollout_ref.rollout.calculate_log_probs=True \
85
+ actor_rollout_ref.actor.megatron.dtype=${dtype} \
86
+ data.max_prompt_length=${max_prompt_length} \
87
+ data.max_response_length=${max_response_length} \
88
+ data.train_batch_size=${train_prompt_bsz} \
89
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
90
+ algorithm.adv_estimator=${adv_estimator} \
91
+ algorithm.use_kl_in_reward=${use_kl_in_reward} \
92
+ algorithm.kl_ctrl.kl_coef=${kl_coef} \
93
+ actor_rollout_ref.model.use_fused_kernels=True \
94
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
95
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
96
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
97
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
98
+ actor_rollout_ref.actor.clip_ratio_c=10.0 \
99
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
100
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
101
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
102
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
103
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
104
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
105
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
106
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
107
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
108
+ actor_rollout_ref.model.path="${MODEL_PATH}" \
109
+ actor_rollout_ref.actor.optim.lr=1e-6 \
110
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
111
+ actor_rollout_ref.actor.optim.weight_decay=0.1 \
112
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
113
+ actor_rollout_ref.actor.megatron.param_offload=${offload} \
114
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
115
+ actor_rollout_ref.actor.megatron.grad_offload=${offload} \
116
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} \
117
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} \
118
+ actor_rollout_ref.actor.megatron.expert_model_parallel_size=${train_ep} \
119
+ actor_rollout_ref.actor.entropy_coeff=0 \
120
+ actor_rollout_ref.actor.optim.clip_grad=1.0 \
121
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
122
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
123
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
124
+ actor_rollout_ref.rollout.enable_chunked_prefill=True \
125
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
126
+ actor_rollout_ref.rollout.temperature=${temperature} \
127
+ actor_rollout_ref.rollout.top_p=${top_p} \
128
+ actor_rollout_ref.rollout.top_k=${top_k} \
129
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
130
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
131
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
132
+ actor_rollout_ref.actor.megatron.use_mbridge=True \
133
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True \
134
+ actor_rollout_ref.rollout.val_kwargs.n=1 \
135
+ actor_rollout_ref.rollout.calculate_log_probs=True \
136
+ +actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True \
137
+ trainer.logger=['console','wandb'] \
138
+ trainer.project_name="${project_name}" \
139
+ trainer.experiment_name="${exp_name}" \
140
+ trainer.n_gpus_per_node=8 \
141
+ trainer.nnodes="${NNODES}" \
142
+ trainer.val_before_train=False \
143
+ trainer.test_freq=5 \
144
+ trainer.save_freq=-1 \
145
+ trainer.total_epochs=10 \
146
+ trainer.default_local_dir="${CKPTS_DIR}" \
147
+ trainer.resume_mode=auto \
148
+ trainer.log_val_generations=10
ICL/DAPO/verl-recipe/dapo/test_dapo_qwen3next_80b_megatron.sh ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -xeuo pipefail
3
+
4
+
5
+ ################################################### document for qwen3next ###################################################
6
+
7
+ ####################### running environment: #######################
8
+
9
+ # option 1: use pre-built docker images verlai/verl:vll012.exp or verlai/verl:sgl056.exp
10
+
11
+ # option 2: self build TE>=2.8, megatron with dev branch and megatron-bridge with main branch
12
+
13
+ ####################### how we support qwen3next? #######################
14
+ # we support qwen3next with megatron-bridge, which is enabled by set `vanilla_mbridge=False`
15
+
16
+ ####################### limitations: #######################
17
+ # 1. context parallel(CP) is not supported until this PR is merged: https://github.com/NVIDIA/Megatron-LM/pull/2614
18
+ # 2. sequence packing(aka thd) is not supported, we must set `actor_rollout_ref.actor.megatron.use_remove_padding=False`, until this PR is merged: https://github.com/NVIDIA/Megatron-LM/pull/2644
19
+
20
+ ## if sequence packing is disabled, we recommend to set `use_dynamic_bsz=False` and set micro batchsize to 1,
21
+ ## otherwise the data will be padded to the max length of the batch, which is not efficient. But it's not mandatory
22
+
23
+
24
+
25
+
26
+ ################################################### quick config ###################################################
27
+
28
+ # pip install --no-deps --no-cache-dir git+https://github.com/NVIDIA/Megatron-LM.git@dev # install megatron from dev branch
29
+ # pip install --no-deps git+https://github.com/NVIDIA-Nemo/Megatron-Bridge.git # install megatron-bridge from main branch
30
+
31
+
32
+ rollout_mode="async"
33
+ return_raw_chat="True"
34
+ export VLLM_USE_V1=1
35
+ rollout_name="vllm" # sglang or vllm
36
+ dtype="bfloat16"
37
+
38
+
39
+ project_name='DAPO-test'
40
+ exp_name='qwen3next'
41
+
42
+ adv_estimator=grpo
43
+
44
+ use_kl_in_reward=False
45
+ kl_coef=0.0
46
+ use_kl_loss=False
47
+ kl_loss_coef=0.0
48
+
49
+ clip_ratio_low=0.2
50
+ clip_ratio_high=0.28
51
+
52
+ max_prompt_length=$((1024 * 2))
53
+ max_response_length=$((1024 * 8))
54
+ enable_overlong_buffer=True
55
+ overlong_buffer_len=$((1024 * 4))
56
+ overlong_penalty_factor=1.0
57
+
58
+ loss_agg_mode="token-mean"
59
+
60
+ train_prompt_bsz=32
61
+ n_resp_per_prompt=16
62
+ train_prompt_mini_bsz=32
63
+
64
+ # Ray
65
+ RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
66
+ WORKING_DIR=${WORKING_DIR:-"${PWD}"}
67
+ RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/verl/trainer/runtime_env.yaml"}
68
+ NNODES=${NNODES:-4}
69
+ # Paths
70
+ MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen3-Next-80B-A3B-Instruct"}
71
+ CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
72
+ TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
73
+ TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
74
+
75
+ # Algorithm
76
+ temperature=1.0
77
+ top_p=1.0
78
+ top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
79
+ val_top_p=0.7
80
+
81
+ # Performance Related Parameter
82
+ use_dynamic_bsz=False
83
+ actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 10 / 10))
84
+ infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
85
+ offload=True
86
+ gen_tp=16
87
+ train_tp=2
88
+ EP=32
89
+ ETP=1
90
+ train_pp=1
91
+
92
+ ################################################### start of config ###################################################
93
+
94
+ FP8=(
95
+ # # train
96
+ # +actor_rollout_ref.actor.megatron.override_transformer_config.fp8="e4m3" # e4m3 or hybrid
97
+ # +actor_rollout_ref.actor.megatron.override_transformer_config.fp8_recipe="blockwise"
98
+ # +actor_rollout_ref.actor.optim.override_optimizer_config.fp8_recipe="blockwise"
99
+ # # rollout
100
+ # +actor_rollout_ref.rollout.quantization="fp8"
101
+ )
102
+
103
+ DATA=(
104
+ # dddd
105
+ data.train_files="${TRAIN_FILE}"
106
+ data.val_files="${TEST_FILE}"
107
+ data.prompt_key=prompt
108
+ data.return_raw_chat=$return_raw_chat
109
+ data.truncation='left'
110
+ data.max_prompt_length=${max_prompt_length}
111
+ data.max_response_length=${max_response_length}
112
+ data.train_batch_size=${train_prompt_bsz}
113
+ )
114
+
115
+ REWARD_MODEL=(
116
+ +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer}
117
+ +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len}
118
+ +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor}
119
+ +reward_model.reward_kwargs.overlong_buffer_cfg.log=False
120
+ +reward_model.reward_kwargs.max_resp_len=${max_response_length}
121
+ reward_model.reward_manager=dapo
122
+ )
123
+
124
+ PERF_OPT=(
125
+ +actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True
126
+ actor_rollout_ref.actor.megatron.use_remove_padding=False
127
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
128
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
129
+ +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
130
+ actor_rollout_ref.actor.megatron.override_transformer_config.attention_backend=auto
131
+ +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=1
132
+ +actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
133
+ +actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
134
+ +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
135
+ )
136
+
137
+ ACTOR=(
138
+ actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
139
+ actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
140
+ actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low}
141
+ actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high}
142
+ actor_rollout_ref.actor.clip_ratio_c=10.0
143
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
144
+ actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
145
+ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
146
+ actor_rollout_ref.actor.optim.lr=1e-6
147
+ actor_rollout_ref.actor.optim.lr_warmup_steps=10
148
+ actor_rollout_ref.actor.optim.weight_decay=0.1
149
+ actor_rollout_ref.actor.optim.clip_grad=1.0
150
+ actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
151
+ actor_rollout_ref.actor.megatron.param_offload=${offload}
152
+ actor_rollout_ref.actor.megatron.optimizer_offload=${offload}
153
+ actor_rollout_ref.actor.megatron.grad_offload=${offload}
154
+ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp}
155
+ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp}
156
+ actor_rollout_ref.actor.megatron.expert_model_parallel_size=${EP}
157
+ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=${ETP}
158
+ actor_rollout_ref.actor.entropy_coeff=0
159
+ actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode}
160
+ actor_rollout_ref.actor.megatron.use_mbridge=True
161
+ actor_rollout_ref.actor.megatron.vanilla_mbridge=False
162
+ actor_rollout_ref.model.use_remove_padding=False
163
+ )
164
+
165
+ ROLLOUT=(
166
+ actor_rollout_ref.rollout.name=${rollout_name}
167
+ actor_rollout_ref.rollout.mode=${rollout_mode}
168
+ actor_rollout_ref.rollout.dtype=${dtype}
169
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.7
170
+ actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
171
+ actor_rollout_ref.rollout.enable_chunked_prefill=True
172
+ actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length))
173
+ actor_rollout_ref.rollout.temperature=${temperature}
174
+ actor_rollout_ref.rollout.top_p=${top_p}
175
+ actor_rollout_ref.rollout.top_k=${top_k}
176
+ actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
177
+ actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p}
178
+ actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
179
+ actor_rollout_ref.rollout.val_kwargs.do_sample=True
180
+ actor_rollout_ref.rollout.val_kwargs.n=1
181
+ actor_rollout_ref.rollout.calculate_log_probs=True
182
+ actor_rollout_ref.rollout.n=${n_resp_per_prompt}
183
+ )
184
+
185
+ TRAINER=(
186
+ trainer.logger=['console','wandb']
187
+ trainer.project_name="${project_name}"
188
+ trainer.experiment_name="${exp_name}"
189
+ trainer.n_gpus_per_node=8
190
+ trainer.nnodes="${NNODES}"
191
+ trainer.val_before_train=False
192
+ trainer.test_freq=5
193
+ trainer.save_freq=-1
194
+ trainer.total_epochs=10
195
+ trainer.default_local_dir="${CKPTS_DIR}"
196
+ trainer.resume_mode=auto
197
+ trainer.log_val_generations=10
198
+ )
199
+
200
+ FORWARD_ONLY_SETS=(
201
+ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4
202
+ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4
203
+ actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
204
+ actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
205
+ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
206
+ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
207
+ )
208
+
209
+ MODEL=(
210
+ actor_rollout_ref.model.path="${MODEL_PATH}"
211
+ )
212
+
213
+ ALGORITHM=(
214
+ algorithm.adv_estimator=${adv_estimator}
215
+ algorithm.use_kl_in_reward=${use_kl_in_reward}
216
+ algorithm.kl_ctrl.kl_coef=${kl_coef}
217
+ )
218
+ ################################################### start script ###################################################
219
+
220
+ python3 -m verl.trainer.main_ppo \
221
+ --config-path=config \
222
+ --config-name='ppo_megatron_trainer.yaml' \
223
+ "${DATA[@]}" \
224
+ "${ALGORITHM[@]}" \
225
+ "${MODEL[@]}" \
226
+ "${ROLLOUT[@]}" \
227
+ "${ACTOR[@]}" \
228
+ "${REWARD_MODEL[@]}" \
229
+ "${FP8[@]}" \
230
+ "${PERF_OPT[@]}" \
231
+ "${TRAINER[@]}" \
232
+ "${FORWARD_ONLY_SETS[@]}" \
ICL/DAPO/verl-recipe/deepeyes/README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
2
+
3
+ This directory contains the implementation for reproducing the DeepEyes paper within the verl framework, supporting multi-turn visual tool calls. This implementation is based on the original [DeepEyes paper](https://arxiv.org/abs/2505.14362) and its [official implementation](https://github.com/Visual-Agent/DeepEyes), integrated with the multi-modal and multi-turn capabilities of the verl framework.
4
+
5
+ ## Reproducing the Experiment
6
+
7
+ > **Note on the 'Chart' Dataset:**
8
+ >
9
+ > The provided preprocessing script intentionally excludes `data_v0.8_visual_toolbox_v2.parquet`, which contains the 'Chart' data. This subset consists of very high-resolution images, often resembling large figures composed of multiple sub-plots, much like those found in academic papers.
10
+ >
11
+ > Consequently, even after using the zoom-in tool, the resulting cropped images remain large. This poses a significant risk of causing Out-of-Memory (OOM) errors, which can abruptly terminate the training process.
12
+ >
13
+ > **We strongly recommend against training on the 'Chart' dataset on a single node.**
14
+
15
+ > **Note on the 'thinklite' Dataset:**
16
+ > Many images in the `thinklite` dataset have a very low resolution, with either a height or width below 28 pixels. This fails to meet the minimum input size required by the Qwen-2.5VL image processor and would cause errors during data loading.
17
+ >
18
+ > To mitigate this, we upscale these low-resolution images to satisfy the processor's requirements. However, please be aware that because the original resolution is low, subsequent `crop` operations by the zoom-in tool might frequently trigger exceptions, which could in turn affect the model's tool-use performance.
19
+
20
+ First, launch an inference service to act as a judge for reward calculation. You can use the following script as a reference:
21
+
22
+ ```bash
23
+ python -m sglang.launch_server --model-path /path/to/Qwen2.5-72B-Instruct \
24
+ --port 18901 \
25
+ --tp-size 8 \
26
+ --context-length 32768 \
27
+ --trust-remote-code \
28
+ --log-requests false
29
+ ```
30
+
31
+ Next, you can start the training:
32
+
33
+ ```bash
34
+ bash recipe/deepeyes/run_deepeyes_grpo.sh
35
+ ```
36
+
37
+ ## Performance
38
+
39
+ See [Comment](https://github.com/volcengine/verl/pull/2398#issuecomment-3157142856) for more details.
40
+
41
+ Note: AgentLoop does not directly record num_tool_calls, but records num_turns. In our scenario, you can calculate the number of tool calls by num_tool_calls = num_turns / 2 - 1.
42
+
43
+ ## References and Acknowledgements
44
+
45
+ - [DeepEyes Paper](https://arxiv.org/abs/2505.14362)
46
+ - [DeepEyes Official Implementation](https://github.com/Visual-Agent/DeepEyes)
47
+
48
+ ---
49
+ If you need further details for reproduction or encounter any issues, feel free to open an issue or contact the maintainers.