Commit ·
1faccd4
0
Parent(s):
initial clean commit
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gemini/config.yaml +10 -0
- .git-blame-ignore-revs +13 -0
- .github/CODEOWNERS +27 -0
- .github/ISSUE_TEMPLATE/bug-report.yml +65 -0
- .github/ISSUE_TEMPLATE/config.yml +2 -0
- .github/ISSUE_TEMPLATE/feature-request.yml +32 -0
- .github/PULL_REQUEST_TEMPLATE.md +41 -0
- .github/dependabot.yml +9 -0
- .github/workflows/README.md +73 -0
- .github/workflows/check-pr-title.yml +58 -0
- .github/workflows/cpu_unit_tests.yml +118 -0
- .github/workflows/doc.yml +101 -0
- .github/workflows/docker-build-ascend-a2.yml +84 -0
- .github/workflows/docker-build-ascend-a3.yml +84 -0
- .github/workflows/e2e_ascend.yml +166 -0
- .github/workflows/e2e_fully_async_policy.yml +170 -0
- .github/workflows/e2e_one_step_off_policy.yml +171 -0
- .github/workflows/e2e_one_step_off_policy_ascend.yml +169 -0
- .github/workflows/e2e_ppo_grpo_trainer_trtllm.yml +285 -0
- .github/workflows/e2e_ppo_trainer.yml +78 -0
- .github/workflows/e2e_ppo_trainer_megatron_sglang.yml +201 -0
- .github/workflows/e2e_ppo_trainer_megatron_sglang_2.yml +201 -0
- .github/workflows/e2e_ppo_trainer_megatron_vllm.yml +212 -0
- .github/workflows/e2e_ppo_trainer_megatron_vllm_2.yml +318 -0
- .github/workflows/e2e_ppo_trainer_megatron_vllm_2_ascend.yml +233 -0
- .github/workflows/e2e_ppo_trainer_veomni_vllm.yml +153 -0
- .github/workflows/e2e_sft_llm.yml +153 -0
- .github/workflows/e2e_sft_llm_ascend.yml +160 -0
- .github/workflows/e2e_sft_vlm.yml +128 -0
- .github/workflows/gpu_unit_tests.yml +137 -0
- .github/workflows/model.yml +184 -0
- .github/workflows/model_ascend.yml +137 -0
- .github/workflows/nightly_ascend.yml +174 -0
- .github/workflows/npu_unit_tests.yml +126 -0
- .github/workflows/pre-commit.yml +41 -0
- .github/workflows/precommit-autofix.yml +52 -0
- .github/workflows/reward_model_sglang.yml +134 -0
- .github/workflows/reward_model_vllm.yml +134 -0
- .github/workflows/reward_model_vllm_ascend.yml +113 -0
- .github/workflows/sanity.yml +108 -0
- .github/workflows/scorecard.yml +66 -0
- .github/workflows/secrets_scan.yml +22 -0
- .github/workflows/sgl.yml +165 -0
- .github/workflows/type-coverage-check.yml +31 -0
- .github/workflows/vllm.yml +169 -0
- .gitignore +139 -0
- .gitmodules +3 -0
- .pre-commit-config.yaml +45 -0
- .readthedocs.yaml +19 -0
- CONTRIBUTING.md +90 -0
.gemini/config.yaml
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
have_fun: false
|
| 2 |
+
code_review:
|
| 3 |
+
disable: false
|
| 4 |
+
comment_severity_threshold: HIGH
|
| 5 |
+
max_review_comments: -1
|
| 6 |
+
pull_request_opened:
|
| 7 |
+
help: false
|
| 8 |
+
summary: false
|
| 9 |
+
code_review: true
|
| 10 |
+
ignore_patterns: []
|
.git-blame-ignore-revs
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Local uasge: git config blame.ignoreRevsFile .git-blame-ignore-revs
|
| 2 |
+
|
| 3 |
+
# [dev] feat: immigrate from yapf & pylint to ruff based on pre-commit
|
| 4 |
+
# Changed 268 files, +10k/-9k lines. This is the biggest formatter change.
|
| 5 |
+
b00f77d8559b48d57a33c0132a5ba1c81891a536
|
| 6 |
+
|
| 7 |
+
# [ci] refactor: reduce ruff line-length from 300 to 120
|
| 8 |
+
# Changed 238 files, +6k/-1k lines. Global formatting change.
|
| 9 |
+
00a10a8ef389556f957a2f36132b2358fd6a109f
|
| 10 |
+
|
| 11 |
+
# [Lint] fix: linting errors in all files
|
| 12 |
+
# Changed 179 files, +1k/-3k lines. Global lint fix.
|
| 13 |
+
8e5ad4688a13de81727c014a3c2e2fb26324bc20
|
.github/CODEOWNERS
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/docs @eric-haibin-lin @zhaochenyang20 @hongpeng-guo
|
| 2 |
+
/docs/amd_tutorial @yushengsu-thu
|
| 3 |
+
/docs/slang_multiturn @zhaochenyang20 @SwordFaith
|
| 4 |
+
/docs/ascend_tutorial @FightingZhen
|
| 5 |
+
|
| 6 |
+
/third_party/sglang @zhaochenyang20 @SwordFaith
|
| 7 |
+
/third_party/vllm @PeterSH6 @wuxibin89
|
| 8 |
+
|
| 9 |
+
/examples/grpo_trainer @vermouth1992 @PeterSH6 @tardis-key @FightingZhen @ji-huazhong
|
| 10 |
+
|
| 11 |
+
/verl/single_controller @zw0610 @wuxibin89 @hongpeng-guo
|
| 12 |
+
/verl/trainer @eric-haibin-lin @vermouth1992 @tongyx361 @PeterSH6
|
| 13 |
+
/verl/models/mcore @ISEEKYAN @vermouth1992
|
| 14 |
+
/verl/models/transformers @vermouth1992 @PeterSH6 @tardis-key @FightingZhen @ji-huazhong
|
| 15 |
+
/verl/workers/engine @eric-haibin-lin @vermouth1992 @ZihengJiang
|
| 16 |
+
/verl/workers/roles @eric-haibin-lin @vermouth1992 @ZihengJiang
|
| 17 |
+
/verl/workers/engine/fsdp @eric-haibin-lin @vermouth1992 @ZihengJiang
|
| 18 |
+
/verl/workers/rollout/vllm_rollout @wuxibin89 @PeterSH6 @chenhaiq
|
| 19 |
+
/verl/workers/rollout/sglang_rollout @zhaochenyang20 @SwordFaith @chenhaiq
|
| 20 |
+
/verl/workers/actor/megatron_actor.py @ISEEKYAN @vermouth1992
|
| 21 |
+
/verl/workers/critic/megatron_critic.py @ISEEKYAN @vermouth1992
|
| 22 |
+
/verl/workers/megatron_workers.py @ISEEKYAN @vermouth1992
|
| 23 |
+
/verl/experimental @wuxibin89 @ArronHZG
|
| 24 |
+
|
| 25 |
+
/tests/single_controller @zw0610 @wuxibin89
|
| 26 |
+
/tests/trainer @eric-haibin-lin @vermouth1992 @tongyx361 @PeterSH6
|
| 27 |
+
/tests/workers/rollout/vllm_rollout @wuxibin89 @PeterSH6 @chenhaiq
|
.github/ISSUE_TEMPLATE/bug-report.yml
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# modified from https://github.com/huggingface/transformers/blob/main/.github/ISSUE_TEMPLATE/bug-report.yml?plain=1
|
| 2 |
+
name: "\U0001F41B Bug Report"
|
| 3 |
+
description: Submit a bug report to help us improve verl
|
| 4 |
+
labels: [ "bug" ]
|
| 5 |
+
body:
|
| 6 |
+
- type: markdown
|
| 7 |
+
attributes:
|
| 8 |
+
value: |
|
| 9 |
+
Thanks for taking the time to fill out this bug report! 🤗
|
| 10 |
+
|
| 11 |
+
- type: textarea
|
| 12 |
+
id: system-info
|
| 13 |
+
attributes:
|
| 14 |
+
label: System Info
|
| 15 |
+
description: Please share your system info with us. You can run the command `python scripts/diagnose.py` and copy-paste its output below.
|
| 16 |
+
placeholder: verl version, platform, python version, ...
|
| 17 |
+
validations:
|
| 18 |
+
required: true
|
| 19 |
+
|
| 20 |
+
- type: checkboxes
|
| 21 |
+
id: information-scripts-examples
|
| 22 |
+
attributes:
|
| 23 |
+
label: Information
|
| 24 |
+
description: 'The problem arises when using:'
|
| 25 |
+
options:
|
| 26 |
+
- label: "The official example scripts"
|
| 27 |
+
- label: "My own modified scripts"
|
| 28 |
+
|
| 29 |
+
- type: checkboxes
|
| 30 |
+
id: information-tasks
|
| 31 |
+
attributes:
|
| 32 |
+
label: Tasks
|
| 33 |
+
description: "The tasks I am working on are:"
|
| 34 |
+
options:
|
| 35 |
+
- label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
|
| 36 |
+
- label: "My own task or dataset (give details below)"
|
| 37 |
+
|
| 38 |
+
- type: textarea
|
| 39 |
+
id: reproduction
|
| 40 |
+
validations:
|
| 41 |
+
required: true
|
| 42 |
+
attributes:
|
| 43 |
+
label: Reproduction
|
| 44 |
+
description: |
|
| 45 |
+
Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
|
| 46 |
+
Please include relevant config information with your code.
|
| 47 |
+
If you have code snippets, error messages, stack traces please provide them here as well.
|
| 48 |
+
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
|
| 49 |
+
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
|
| 50 |
+
|
| 51 |
+
placeholder: |
|
| 52 |
+
Steps to reproduce the behavior:
|
| 53 |
+
|
| 54 |
+
1.
|
| 55 |
+
2.
|
| 56 |
+
3.
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
- type: textarea
|
| 60 |
+
id: expected-behavior
|
| 61 |
+
validations:
|
| 62 |
+
required: true
|
| 63 |
+
attributes:
|
| 64 |
+
label: Expected behavior
|
| 65 |
+
description: "A clear and concise description of what you would expect to happen."
|
.github/ISSUE_TEMPLATE/config.yml
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
blank_issues_enabled: true
|
| 2 |
+
version: 0.1
|
.github/ISSUE_TEMPLATE/feature-request.yml
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# modified from https://github.com/huggingface/transformers/blob/main/.github/ISSUE_TEMPLATE/feature-request.yml?plain=1
|
| 2 |
+
name: "\U0001F680 Feature request"
|
| 3 |
+
description: Submit a proposal/request for a new verl feature
|
| 4 |
+
labels: [ "Feature request" ]
|
| 5 |
+
body:
|
| 6 |
+
- type: textarea
|
| 7 |
+
id: feature-request
|
| 8 |
+
validations:
|
| 9 |
+
required: true
|
| 10 |
+
attributes:
|
| 11 |
+
label: Feature request
|
| 12 |
+
description: |
|
| 13 |
+
A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist.
|
| 14 |
+
|
| 15 |
+
- type: textarea
|
| 16 |
+
id: motivation
|
| 17 |
+
validations:
|
| 18 |
+
required: true
|
| 19 |
+
attributes:
|
| 20 |
+
label: Motivation
|
| 21 |
+
description: |
|
| 22 |
+
Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
- type: textarea
|
| 26 |
+
id: contribution
|
| 27 |
+
validations:
|
| 28 |
+
required: true
|
| 29 |
+
attributes:
|
| 30 |
+
label: Your contribution
|
| 31 |
+
description: |
|
| 32 |
+
Is there any way that you could help, e.g. by submitting a PR? Make sure to read the CONTRIBUTING.MD [readme](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md)
|
.github/PULL_REQUEST_TEMPLATE.md
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
### What does this PR do?
|
| 2 |
+
|
| 3 |
+
> Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.
|
| 4 |
+
|
| 5 |
+
### Checklist Before Starting
|
| 6 |
+
|
| 7 |
+
- [ ] Search for similar PRs. Paste at least one query link here: ...
|
| 8 |
+
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI)
|
| 9 |
+
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off`
|
| 10 |
+
- If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]`
|
| 11 |
+
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
|
| 12 |
+
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title.
|
| 13 |
+
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
|
| 14 |
+
|
| 15 |
+
### Test
|
| 16 |
+
|
| 17 |
+
> For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
|
| 18 |
+
|
| 19 |
+
### API and Usage Example
|
| 20 |
+
|
| 21 |
+
> Demonstrate how the API changes if any, and provide usage example(s) if possible.
|
| 22 |
+
|
| 23 |
+
```python
|
| 24 |
+
# Add code snippet or script demonstrating how to use this
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
### Design & Code Changes
|
| 28 |
+
|
| 29 |
+
> Demonstrate the high-level design if this PR is complex, and list the specific changes.
|
| 30 |
+
|
| 31 |
+
### Checklist Before Submitting
|
| 32 |
+
|
| 33 |
+
> [!IMPORTANT]
|
| 34 |
+
> Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
|
| 35 |
+
|
| 36 |
+
- [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
|
| 37 |
+
- [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always`
|
| 38 |
+
- [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs).
|
| 39 |
+
- [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ...
|
| 40 |
+
- [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
|
| 41 |
+
- [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
|
.github/dependabot.yml
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Enabled the dependabot to check the dependencies of the project
|
| 2 |
+
## Dependabot will open pull requests to update dependencies automatically
|
| 3 |
+
|
| 4 |
+
version: 2
|
| 5 |
+
updates:
|
| 6 |
+
- package-ecosystem: pip
|
| 7 |
+
directory: "/"
|
| 8 |
+
schedule:
|
| 9 |
+
interval: weekly
|
.github/workflows/README.md
ADDED
|
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
### Adding a New Workflow
|
| 2 |
+
|
| 3 |
+
When adding a new workflow for continuous integration (CI), you have two runner options: a fixed runner or a machine from the vemlp.
|
| 4 |
+
|
| 5 |
+
- **Fixed Runner**: To use a fixed runner, specify it in your workflow using the `runs-on` keyword, like `runs-on: [L20x8]`.
|
| 6 |
+
- **Vemlp Runner**: Opting for a Vemlp machine allows you to launch tasks elastically.
|
| 7 |
+
|
| 8 |
+
Here is a template to assist you. This template is designed for using Vemlp machines. Currently, for each workflow, you need to create a `setup` and a `cleanup` job. When using this template, the main parts you need to modify are the `IMAGE` environment variable and the specific `job steps`.
|
| 9 |
+
|
| 10 |
+
```yaml
|
| 11 |
+
name: Your Default Workflow
|
| 12 |
+
|
| 13 |
+
on:
|
| 14 |
+
push:
|
| 15 |
+
branches:
|
| 16 |
+
- main
|
| 17 |
+
- v0.*
|
| 18 |
+
pull_request:
|
| 19 |
+
branches:
|
| 20 |
+
- main
|
| 21 |
+
- v0.*
|
| 22 |
+
paths:
|
| 23 |
+
- "**/*.py"
|
| 24 |
+
- ".github/workflows/template.yml"
|
| 25 |
+
|
| 26 |
+
concurrency:
|
| 27 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 28 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 29 |
+
|
| 30 |
+
permissions:
|
| 31 |
+
contents: read
|
| 32 |
+
|
| 33 |
+
env:
|
| 34 |
+
IMAGE: "your vemlp image" # e.g. "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
|
| 35 |
+
DYNAMIC_RUNNER_URL: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner" # public veFaas api
|
| 36 |
+
|
| 37 |
+
jobs:
|
| 38 |
+
setup:
|
| 39 |
+
if: github.repository_owner == 'verl-project'
|
| 40 |
+
runs-on: ubuntu-latest
|
| 41 |
+
outputs:
|
| 42 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 43 |
+
task-id: ${{ steps.create-runner.outputs.task-id }}
|
| 44 |
+
steps:
|
| 45 |
+
- uses: actions/checkout@v4
|
| 46 |
+
- id: create-runner
|
| 47 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 48 |
+
with:
|
| 49 |
+
mode: "create"
|
| 50 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_URL }}"
|
| 51 |
+
image: "${{ env.DEFAULT_IMAGE }}"
|
| 52 |
+
|
| 53 |
+
your_job:
|
| 54 |
+
needs: setup
|
| 55 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'default-runner' }}"]
|
| 56 |
+
steps:
|
| 57 |
+
xxxx # your jobs
|
| 58 |
+
|
| 59 |
+
cleanup:
|
| 60 |
+
runs-on: ubuntu-latest
|
| 61 |
+
needs: [setup, your_job]
|
| 62 |
+
if: always()
|
| 63 |
+
steps:
|
| 64 |
+
- id: destroy-runner
|
| 65 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 66 |
+
with:
|
| 67 |
+
mode: "destroy"
|
| 68 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_URL }}"
|
| 69 |
+
task-id: "${{ needs.setup.outputs.task-id }}"
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
### Model and Dataset
|
| 73 |
+
To avoid CI relies on network, we pre-download dataset on a NFS on the CI machine. The path for models are \${HOME}/models and the path for dataset is \${HOME}/models/hf_data.
|
.github/workflows/check-pr-title.yml
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
on:
|
| 34 |
+
pull_request:
|
| 35 |
+
types: [opened, edited, synchronize]
|
| 36 |
+
|
| 37 |
+
jobs:
|
| 38 |
+
check-title:
|
| 39 |
+
runs-on: ubuntu-latest
|
| 40 |
+
steps:
|
| 41 |
+
- name: Checkout code
|
| 42 |
+
uses: actions/checkout@v4
|
| 43 |
+
|
| 44 |
+
- name: Set up Python
|
| 45 |
+
uses: actions/setup-python@v5
|
| 46 |
+
with:
|
| 47 |
+
python-version: '3.11'
|
| 48 |
+
|
| 49 |
+
- name: Run PR title checker
|
| 50 |
+
run: python3 tests/special_sanity/check_pr_title.py
|
| 51 |
+
env:
|
| 52 |
+
PR_TITLE: ${{ github.event.pull_request.title }}
|
| 53 |
+
|
| 54 |
+
- name: Run PR description checker
|
| 55 |
+
run: python3 tests/special_sanity/check_pr_description.py
|
| 56 |
+
env:
|
| 57 |
+
PR_TITLE: ${{ github.event.pull_request.title }}
|
| 58 |
+
GITHUB_EVENT_PATH: ${{ github.event_path }}
|
.github/workflows/cpu_unit_tests.yml
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: cpu_unit_tests
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
push:
|
| 38 |
+
branches:
|
| 39 |
+
- main
|
| 40 |
+
- v0.*
|
| 41 |
+
pull_request:
|
| 42 |
+
branches:
|
| 43 |
+
- main
|
| 44 |
+
- v0.*
|
| 45 |
+
paths:
|
| 46 |
+
- "**/*.py"
|
| 47 |
+
- .github/workflows/cpu_unit_tests.yml
|
| 48 |
+
|
| 49 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 50 |
+
concurrency:
|
| 51 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 52 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 53 |
+
|
| 54 |
+
# Declare permissions just read content.
|
| 55 |
+
permissions:
|
| 56 |
+
contents: read
|
| 57 |
+
|
| 58 |
+
env:
|
| 59 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 60 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 61 |
+
|
| 62 |
+
jobs:
|
| 63 |
+
setup:
|
| 64 |
+
if: github.repository_owner == 'verl-project'
|
| 65 |
+
runs-on: ubuntu-latest
|
| 66 |
+
outputs:
|
| 67 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 68 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 69 |
+
steps:
|
| 70 |
+
- uses: actions/checkout@v4
|
| 71 |
+
- id: create-runner
|
| 72 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 73 |
+
with:
|
| 74 |
+
mode: "create"
|
| 75 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 76 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 77 |
+
|
| 78 |
+
cpu_unit_tests:
|
| 79 |
+
needs: setup
|
| 80 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 81 |
+
timeout-minutes: 20 # Increase this timeout value as needed
|
| 82 |
+
env:
|
| 83 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 84 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 85 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 86 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 87 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 88 |
+
TORCH_COMPILE_DISABLE: 1
|
| 89 |
+
TORCHINDUCTOR_DISABLE: 1
|
| 90 |
+
steps:
|
| 91 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 92 |
+
with:
|
| 93 |
+
fetch-depth: 0
|
| 94 |
+
- name: Install the current repository
|
| 95 |
+
run: |
|
| 96 |
+
pip3 install -r requirements-test.txt
|
| 97 |
+
pip3 install --no-deps -e .
|
| 98 |
+
pip3 install --upgrade "transformers>=5.0.0"
|
| 99 |
+
- name: Download datasets
|
| 100 |
+
run: |
|
| 101 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 102 |
+
python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/hiyouga/geometry3k
|
| 103 |
+
- name: Running CPU unit tests
|
| 104 |
+
run: |
|
| 105 |
+
echo '[pytest]' > pytest.ini
|
| 106 |
+
echo 'python_files = *_on_cpu.py' >> pytest.ini
|
| 107 |
+
pytest -s -x --asyncio-mode=auto tests/
|
| 108 |
+
cleanup:
|
| 109 |
+
runs-on: ubuntu-latest
|
| 110 |
+
needs: [setup, cpu_unit_tests]
|
| 111 |
+
if: always()
|
| 112 |
+
steps:
|
| 113 |
+
- id: destroy-runner
|
| 114 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 115 |
+
with:
|
| 116 |
+
mode: "destroy"
|
| 117 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 118 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/doc.yml
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
name: doc_test
|
| 34 |
+
|
| 35 |
+
on:
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
pull_request:
|
| 43 |
+
branches:
|
| 44 |
+
- main
|
| 45 |
+
- v0.*
|
| 46 |
+
paths:
|
| 47 |
+
- "**/*.py"
|
| 48 |
+
- "docs/**"
|
| 49 |
+
- .github/workflows/doc.yml
|
| 50 |
+
|
| 51 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 52 |
+
concurrency:
|
| 53 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 54 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 55 |
+
|
| 56 |
+
# Declare permissions just read content.
|
| 57 |
+
permissions:
|
| 58 |
+
contents: read # for checkout
|
| 59 |
+
pages: write # for deploy-pages
|
| 60 |
+
id-token: write # for deploy-pages
|
| 61 |
+
|
| 62 |
+
jobs:
|
| 63 |
+
doc_test:
|
| 64 |
+
runs-on: ubuntu-latest
|
| 65 |
+
timeout-minutes: 5 # Increase this timeout value as needed
|
| 66 |
+
strategy:
|
| 67 |
+
matrix:
|
| 68 |
+
python-version: ["3.10"]
|
| 69 |
+
steps:
|
| 70 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 71 |
+
- name: Set up Python ${{ matrix.python-version }}
|
| 72 |
+
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
|
| 73 |
+
with:
|
| 74 |
+
python-version: ${{ matrix.python-version }}
|
| 75 |
+
- name: Install the current repository
|
| 76 |
+
run: |
|
| 77 |
+
pip3 install -r requirements-test.txt
|
| 78 |
+
pip3 install --no-deps -e .
|
| 79 |
+
pip install -r docs/requirements-docs.txt
|
| 80 |
+
|
| 81 |
+
- name: Run doc make html
|
| 82 |
+
run: |
|
| 83 |
+
cd docs
|
| 84 |
+
make clean
|
| 85 |
+
make html SPHINXOPTS="--keep-going -w _build/sphinx.log"
|
| 86 |
+
if grep -q ": ERROR:" _build/sphinx.log; then
|
| 87 |
+
echo "🚨 Sphinx doc build contained ERRORs - see _build/sphinx.log"
|
| 88 |
+
exit 1
|
| 89 |
+
fi
|
| 90 |
+
if grep -q "WARNING: document isn't included in any toctree" _build/sphinx.log; then
|
| 91 |
+
echo "🚨 Sphinx doc build contained WARNING. Please include newly added docs in index.rst. See _build/sphinx.log for details"
|
| 92 |
+
exit 1
|
| 93 |
+
fi
|
| 94 |
+
if grep -q "WARNING: Inline emphasis" _build/sphinx.log; then
|
| 95 |
+
echo "🚨 Sphinx doc build contained WARNING. Please check inline emphasis is correct. See _build/sphinx.log for details"
|
| 96 |
+
exit 1
|
| 97 |
+
fi
|
| 98 |
+
if grep -q "WARNING: Definition list ends without a blank line" _build/sphinx.log; then
|
| 99 |
+
echo "🚨 Sphinx doc build contained WARNING. Please check if the indentation is correct. See _build/sphinx.log for details"
|
| 100 |
+
exit 1
|
| 101 |
+
fi
|
.github/workflows/docker-build-ascend-a2.yml
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: docker-build-ascend-a2
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
workflow_dispatch:
|
| 5 |
+
push:
|
| 6 |
+
branches: ["main"]
|
| 7 |
+
paths:
|
| 8 |
+
- "docker/ascend/Dockerfile.ascend_8.5.0_a2"
|
| 9 |
+
- ".github/workflows/docker-build-ascend-a2.yml"
|
| 10 |
+
release:
|
| 11 |
+
types: [published]
|
| 12 |
+
schedule:
|
| 13 |
+
- cron: "0 16 * * *"
|
| 14 |
+
|
| 15 |
+
jobs:
|
| 16 |
+
build-ascend-image-a2:
|
| 17 |
+
if: ${{ github.event_name != 'pull_request' && github.repository_owner == 'verl-project' }}
|
| 18 |
+
runs-on: ubuntu-latest
|
| 19 |
+
concurrency:
|
| 20 |
+
group: ${{ github.workflow }}-${{ github.ref }}-build-ascend-image-a2
|
| 21 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 22 |
+
steps:
|
| 23 |
+
- name: Remove unnecessary parts in github actions runners to free up disk space
|
| 24 |
+
uses: jlumbroso/free-disk-space@v1.3.1
|
| 25 |
+
with:
|
| 26 |
+
tool-cache: true
|
| 27 |
+
|
| 28 |
+
- name: Checkout code
|
| 29 |
+
uses: actions/checkout@v4
|
| 30 |
+
|
| 31 |
+
- name: Set up Python
|
| 32 |
+
uses: actions/setup-python@v5
|
| 33 |
+
with:
|
| 34 |
+
python-version: "3.11"
|
| 35 |
+
|
| 36 |
+
- name: Get base image name and tag
|
| 37 |
+
id: base_image
|
| 38 |
+
run: |
|
| 39 |
+
BASE_IMAGE_FULL=$(grep '^FROM' ./docker/ascend/Dockerfile.ascend_8.5.0_a2 | head -1 | cut -d' ' -f2)
|
| 40 |
+
echo "Base image full: $BASE_IMAGE_FULL"
|
| 41 |
+
BASE_IMAGE_TAG=$(echo "$BASE_IMAGE_FULL" | cut -d':' -f2)
|
| 42 |
+
echo "Base image tag: $BASE_IMAGE_TAG"
|
| 43 |
+
NEW_IMAGE_NAME="verl-$BASE_IMAGE_TAG"
|
| 44 |
+
echo "New image name: $NEW_IMAGE_NAME"
|
| 45 |
+
echo "base_image_tag=$BASE_IMAGE_TAG" >> "$GITHUB_OUTPUT"
|
| 46 |
+
echo "new_image_name=$NEW_IMAGE_NAME" >> "$GITHUB_OUTPUT"
|
| 47 |
+
|
| 48 |
+
- name: Get image tag
|
| 49 |
+
id: version
|
| 50 |
+
run: |
|
| 51 |
+
BRANCH_NAME=$(echo "${{ github.ref }}" | sed 's/refs\/heads\///g' | sed 's/[^a-zA-Z0-9._-]/_/g')
|
| 52 |
+
if [ "${{ github.event_name }}" = "release" ]; then
|
| 53 |
+
echo "tag=${{ steps.base_image.outputs.new_image_name }}-${{ github.event.release.tag_name }}" >> "$GITHUB_OUTPUT"
|
| 54 |
+
elif [ "$BRANCH_NAME" = "main" ]; then
|
| 55 |
+
echo "tag=${{ steps.base_image.outputs.new_image_name }}-latest" >> "$GITHUB_OUTPUT"
|
| 56 |
+
fi
|
| 57 |
+
|
| 58 |
+
- name: Set up Docker Buildx
|
| 59 |
+
uses: docker/setup-buildx-action@v3
|
| 60 |
+
|
| 61 |
+
- name: Login to Quay.io
|
| 62 |
+
uses: docker/login-action@v3
|
| 63 |
+
with:
|
| 64 |
+
registry: quay.io
|
| 65 |
+
username: ${{ secrets.QUAY_USERNAME }}
|
| 66 |
+
password: ${{ secrets.QUAY_PASSWORD }}
|
| 67 |
+
|
| 68 |
+
- name: Clean Docker cache before build
|
| 69 |
+
run: |
|
| 70 |
+
docker system prune -a -f --volumes || true
|
| 71 |
+
|
| 72 |
+
- name: Build and push images Quay
|
| 73 |
+
uses: docker/build-push-action@v6
|
| 74 |
+
with:
|
| 75 |
+
context: .
|
| 76 |
+
platforms: linux/amd64,linux/arm64
|
| 77 |
+
file: ./docker/ascend/Dockerfile.ascend_8.5.0_a2
|
| 78 |
+
push: true
|
| 79 |
+
tags: |
|
| 80 |
+
quay.io/ascend/verl:${{ steps.version.outputs.tag }}
|
| 81 |
+
cache-from: type=gha
|
| 82 |
+
cache-to: type=gha,mode=max
|
| 83 |
+
build-args: |
|
| 84 |
+
BUILDKIT_INLINE_CACHE=1
|
.github/workflows/docker-build-ascend-a3.yml
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: docker-build-ascend-a3
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
workflow_dispatch:
|
| 5 |
+
push:
|
| 6 |
+
branches: ["main"]
|
| 7 |
+
paths:
|
| 8 |
+
- "docker/ascend/Dockerfile.ascend_8.5.0_a3"
|
| 9 |
+
- ".github/workflows/docker-build-ascend-a3.yml"
|
| 10 |
+
release:
|
| 11 |
+
types: [published]
|
| 12 |
+
schedule:
|
| 13 |
+
- cron: "0 19 * * *"
|
| 14 |
+
|
| 15 |
+
jobs:
|
| 16 |
+
build-ascend-image-a3:
|
| 17 |
+
if: ${{ github.event_name != 'pull_request' && github.repository_owner == 'verl-project' }}
|
| 18 |
+
runs-on: ubuntu-latest
|
| 19 |
+
concurrency:
|
| 20 |
+
group: ${{ github.workflow }}-${{ github.ref }}-build-ascend-image-a3
|
| 21 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 22 |
+
steps:
|
| 23 |
+
- name: Remove unnecessary parts in github actions runners to free up disk space
|
| 24 |
+
uses: jlumbroso/free-disk-space@v1.3.1
|
| 25 |
+
with:
|
| 26 |
+
tool-cache: true
|
| 27 |
+
|
| 28 |
+
- name: Checkout code
|
| 29 |
+
uses: actions/checkout@v4
|
| 30 |
+
|
| 31 |
+
- name: Set up Python
|
| 32 |
+
uses: actions/setup-python@v5
|
| 33 |
+
with:
|
| 34 |
+
python-version: "3.11"
|
| 35 |
+
|
| 36 |
+
- name: Get base image name and tag
|
| 37 |
+
id: base_image
|
| 38 |
+
run: |
|
| 39 |
+
BASE_IMAGE_FULL=$(grep '^FROM' ./docker/ascend/Dockerfile.ascend_8.5.0_a3 | head -1 | cut -d' ' -f2)
|
| 40 |
+
echo "Base image full: $BASE_IMAGE_FULL"
|
| 41 |
+
BASE_IMAGE_TAG=$(echo "$BASE_IMAGE_FULL" | cut -d':' -f2)
|
| 42 |
+
echo "Base image tag: $BASE_IMAGE_TAG"
|
| 43 |
+
NEW_IMAGE_NAME="verl-$BASE_IMAGE_TAG"
|
| 44 |
+
echo "New image name: $NEW_IMAGE_NAME"
|
| 45 |
+
echo "base_image_tag=$BASE_IMAGE_TAG" >> "$GITHUB_OUTPUT"
|
| 46 |
+
echo "new_image_name=$NEW_IMAGE_NAME" >> "$GITHUB_OUTPUT"
|
| 47 |
+
|
| 48 |
+
- name: Get image tag
|
| 49 |
+
id: version
|
| 50 |
+
run: |
|
| 51 |
+
BRANCH_NAME=$(echo "${{ github.ref }}" | sed 's/refs\/heads\///g' | sed 's/[^a-zA-Z0-9._-]/_/g')
|
| 52 |
+
if [ "${{ github.event_name }}" = "release" ]; then
|
| 53 |
+
echo "tag=${{ steps.base_image.outputs.new_image_name }}-${{ github.event.release.tag_name }}" >> "$GITHUB_OUTPUT"
|
| 54 |
+
elif [ "$BRANCH_NAME" = "main" ]; then
|
| 55 |
+
echo "tag=${{ steps.base_image.outputs.new_image_name }}-latest" >> "$GITHUB_OUTPUT"
|
| 56 |
+
fi
|
| 57 |
+
|
| 58 |
+
- name: Set up Docker Buildx
|
| 59 |
+
uses: docker/setup-buildx-action@v3
|
| 60 |
+
|
| 61 |
+
- name: Login to Quay.io
|
| 62 |
+
uses: docker/login-action@v3
|
| 63 |
+
with:
|
| 64 |
+
registry: quay.io
|
| 65 |
+
username: ${{ secrets.QUAY_USERNAME }}
|
| 66 |
+
password: ${{ secrets.QUAY_PASSWORD }}
|
| 67 |
+
|
| 68 |
+
- name: Clean Docker cache before build
|
| 69 |
+
run: |
|
| 70 |
+
docker system prune -a -f --volumes || true
|
| 71 |
+
|
| 72 |
+
- name: Build and push images Quay
|
| 73 |
+
uses: docker/build-push-action@v6
|
| 74 |
+
with:
|
| 75 |
+
context: .
|
| 76 |
+
platforms: linux/amd64,linux/arm64
|
| 77 |
+
file: ./docker/ascend/Dockerfile.ascend_8.5.0_a3
|
| 78 |
+
push: true
|
| 79 |
+
tags: |
|
| 80 |
+
quay.io/ascend/verl:${{ steps.version.outputs.tag }}
|
| 81 |
+
cache-from: type=gha
|
| 82 |
+
cache-to: type=gha,mode=max
|
| 83 |
+
build-args: |
|
| 84 |
+
BUILDKIT_INLINE_CACHE=1
|
.github/workflows/e2e_ascend.yml
ADDED
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_ascend
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
push:
|
| 38 |
+
branches:
|
| 39 |
+
- main
|
| 40 |
+
- v0.*
|
| 41 |
+
pull_request:
|
| 42 |
+
branches:
|
| 43 |
+
- main
|
| 44 |
+
paths:
|
| 45 |
+
- ".github/workflows/e2e_ascend.yml"
|
| 46 |
+
- "examples/data_preprocess/**"
|
| 47 |
+
- "examples/grpo_trainer/**"
|
| 48 |
+
- "examples/ppo_trainer/**"
|
| 49 |
+
- "examples/sft/**"
|
| 50 |
+
- "verl/experimental/one_step_off_policy/**"
|
| 51 |
+
- "tests/special_npu/**"
|
| 52 |
+
- "tests/special_sanity/check_device_api_usage.py"
|
| 53 |
+
- "verl/**"
|
| 54 |
+
- "pyproject.toml"
|
| 55 |
+
- "requirements-npu.txt"
|
| 56 |
+
- "setup.py"
|
| 57 |
+
|
| 58 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 59 |
+
concurrency:
|
| 60 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 61 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 62 |
+
|
| 63 |
+
permissions:
|
| 64 |
+
contents: read
|
| 65 |
+
|
| 66 |
+
jobs:
|
| 67 |
+
llm_rl_job:
|
| 68 |
+
if: github.repository_owner == 'verl-project'
|
| 69 |
+
name: E2E Ascend testing for RL training scenarios of LLM models
|
| 70 |
+
runs-on: linux-aarch64-a2b3-8
|
| 71 |
+
timeout-minutes: 120
|
| 72 |
+
container:
|
| 73 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 74 |
+
options: >-
|
| 75 |
+
--shm-size 16g
|
| 76 |
+
env:
|
| 77 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 78 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 79 |
+
steps:
|
| 80 |
+
- name: Check npu and CANN info
|
| 81 |
+
run: |
|
| 82 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 83 |
+
npu-smi info
|
| 84 |
+
- name: Check initial pip list from image
|
| 85 |
+
run: |
|
| 86 |
+
pip list
|
| 87 |
+
- name: Checkout volcengine/verl repo
|
| 88 |
+
uses: actions/checkout@v4
|
| 89 |
+
with:
|
| 90 |
+
fetch-depth: 0
|
| 91 |
+
clean: true
|
| 92 |
+
- name: Install the current repository
|
| 93 |
+
run: |
|
| 94 |
+
pip install -r requirements-npu.txt
|
| 95 |
+
pip install -e .
|
| 96 |
+
- name: Check final pip list
|
| 97 |
+
run: |
|
| 98 |
+
pip list
|
| 99 |
+
- name: Preprocess gsm8k dataset
|
| 100 |
+
run: |
|
| 101 |
+
python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
|
| 102 |
+
- name: Running gsm8k e2e training tests with PPO on ASCEND NPU (FSDP backend)
|
| 103 |
+
run: |
|
| 104 |
+
ray stop --force
|
| 105 |
+
bash tests/special_npu/run_qwen3_06b_ppo.sh
|
| 106 |
+
rm -rf $HOME/ckpts
|
| 107 |
+
- name: Running gsm8k e2e training tests with GRPO on ASCEND NPU (FSDP backend)
|
| 108 |
+
run: |
|
| 109 |
+
ray stop --force
|
| 110 |
+
bash tests/special_npu/run_qwen2_5_05b_grpo.sh
|
| 111 |
+
rm -rf $HOME/ckpts
|
| 112 |
+
- name: Running gsm8k e2e training tests with GRPO on ASCEND NPU (MindSpeed backend)
|
| 113 |
+
run: |
|
| 114 |
+
ray stop --force
|
| 115 |
+
USE_DIST_CKPT=True bash tests/special_npu/run_qwen2_5_05b_grpo_mindspeed.sh
|
| 116 |
+
rm -rf $HOME/dist_ckpt/qwen2_5_05b_grpo_mindspeed
|
| 117 |
+
rm -rf $HOME/ckpts
|
| 118 |
+
- name: Running gsm8k e2e training tests with GRPO on ASCEND NPU (MindSpeed backend, MoE Model)
|
| 119 |
+
run: |
|
| 120 |
+
ray stop --force
|
| 121 |
+
USE_DIST_CKPT=True USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen3moe_minimal.json DUMMY_MODEL_PATH=$HOME/dist_ckpt/qwen3_30b_grpo_mindspeed bash tests/special_npu/run_qwen3_30b_grpo_mindspeed.sh
|
| 122 |
+
- name: Running the E2E test with fully_async_policy algorithm (FSDP2)
|
| 123 |
+
run: |
|
| 124 |
+
ray stop --force
|
| 125 |
+
bash tests/special_npu/run_fully_async_policy.sh
|
| 126 |
+
|
| 127 |
+
vlm_rl_job:
|
| 128 |
+
if: github.repository_owner == 'verl-project'
|
| 129 |
+
name: E2E Ascend testing for RL training scenarios of VLM models
|
| 130 |
+
runs-on: linux-aarch64-a2b3-8
|
| 131 |
+
timeout-minutes: 120
|
| 132 |
+
container:
|
| 133 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 134 |
+
options: >-
|
| 135 |
+
--shm-size 16g
|
| 136 |
+
env:
|
| 137 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 138 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 139 |
+
steps:
|
| 140 |
+
- name: Check npu and CANN info
|
| 141 |
+
run: |
|
| 142 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 143 |
+
npu-smi info
|
| 144 |
+
- name: Check initial pip list from image
|
| 145 |
+
run: |
|
| 146 |
+
pip list
|
| 147 |
+
- name: Checkout volcengine/verl repo
|
| 148 |
+
uses: actions/checkout@v4
|
| 149 |
+
with:
|
| 150 |
+
fetch-depth: 0
|
| 151 |
+
clean: true
|
| 152 |
+
- name: Install the current repository
|
| 153 |
+
run: |
|
| 154 |
+
pip install -r requirements-npu.txt
|
| 155 |
+
pip install -e .
|
| 156 |
+
- name: Check final pip list
|
| 157 |
+
run: |
|
| 158 |
+
pip list
|
| 159 |
+
- name: Preprocess geo3k dataset
|
| 160 |
+
run: |
|
| 161 |
+
python examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/.cache/datasets/hiyouga/geometry3k
|
| 162 |
+
- name: Running geo3k e2e training tests with GRPO on ASCEND NPU
|
| 163 |
+
run: |
|
| 164 |
+
ray stop --force
|
| 165 |
+
bash tests/special_npu/run_qwen2_5_vl_3b_npu.sh
|
| 166 |
+
rm -rf $HOME/ckpts
|
.github/workflows/e2e_fully_async_policy.yml
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_fully_async_policy
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
- "!**/*.md"
|
| 46 |
+
- "!**/*.sh"
|
| 47 |
+
# Other entrypoints
|
| 48 |
+
- "!examples/*trainer*"
|
| 49 |
+
- "!tests/**"
|
| 50 |
+
- "!verl/trainer/main_*.py"
|
| 51 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 52 |
+
- "verl/experimental/fully_async_policy"
|
| 53 |
+
pull_request:
|
| 54 |
+
branches:
|
| 55 |
+
- main
|
| 56 |
+
- v0.*
|
| 57 |
+
paths:
|
| 58 |
+
- "**/*.py"
|
| 59 |
+
- "!**/*.md"
|
| 60 |
+
- "!**/*.sh"
|
| 61 |
+
# Other entrypoints
|
| 62 |
+
- "!examples/**"
|
| 63 |
+
- "!tests/**"
|
| 64 |
+
- "!verl/trainer/main_*.py"
|
| 65 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 66 |
+
# Home
|
| 67 |
+
- "verl/experimental/fully_async_policy"
|
| 68 |
+
# Entrypoints
|
| 69 |
+
- ".github/workflows/e2e_fully_async_policy.yml"
|
| 70 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 71 |
+
- "tests/special_e2e/run_fully_async_policy.sh"
|
| 72 |
+
|
| 73 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 74 |
+
concurrency:
|
| 75 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 76 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 77 |
+
|
| 78 |
+
# Declare permissions just read content.
|
| 79 |
+
permissions:
|
| 80 |
+
contents: read
|
| 81 |
+
|
| 82 |
+
env:
|
| 83 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 84 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 85 |
+
|
| 86 |
+
jobs:
|
| 87 |
+
setup:
|
| 88 |
+
if: github.repository_owner == 'verl-project'
|
| 89 |
+
runs-on: ubuntu-latest
|
| 90 |
+
outputs:
|
| 91 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 92 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 93 |
+
steps:
|
| 94 |
+
- uses: actions/checkout@v4
|
| 95 |
+
- id: create-runner
|
| 96 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 97 |
+
with:
|
| 98 |
+
mode: "create"
|
| 99 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 100 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 101 |
+
|
| 102 |
+
# Test FSDP2 strategy
|
| 103 |
+
e2e_fully_async_policy_fsdp2:
|
| 104 |
+
needs: setup
|
| 105 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 106 |
+
timeout-minutes: 10 # Increase timeout for async training
|
| 107 |
+
env:
|
| 108 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 109 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 110 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 111 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 112 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 113 |
+
ACTOR_STRATEGY: "fsdp2"
|
| 114 |
+
steps:
|
| 115 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 116 |
+
with:
|
| 117 |
+
fetch-depth: 0
|
| 118 |
+
- name: Install the current repository
|
| 119 |
+
run: |
|
| 120 |
+
pip3 install -r requirements-test.txt
|
| 121 |
+
pip3 install --no-deps -e .
|
| 122 |
+
pip3 install cupy-cuda12x==13.6.0
|
| 123 |
+
- name: Prepare GSM8K dataset
|
| 124 |
+
run: |
|
| 125 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 126 |
+
- name: Running the E2E test with fully_async_policy algorithm (FSDP2)
|
| 127 |
+
run: |
|
| 128 |
+
ray stop --force
|
| 129 |
+
bash tests/special_e2e/run_fully_async_policy.sh
|
| 130 |
+
|
| 131 |
+
# Test Megatron strategy
|
| 132 |
+
e2e_fully_async_policy_megatron:
|
| 133 |
+
needs: setup
|
| 134 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 135 |
+
timeout-minutes: 10 # Increase timeout for async training
|
| 136 |
+
env:
|
| 137 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 138 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 139 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 140 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 141 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 142 |
+
ACTOR_STRATEGY: "megatron"
|
| 143 |
+
steps:
|
| 144 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 145 |
+
with:
|
| 146 |
+
fetch-depth: 0
|
| 147 |
+
- name: Install the current repository
|
| 148 |
+
run: |
|
| 149 |
+
pip3 install -r requirements-test.txt
|
| 150 |
+
pip3 install --no-deps -e .
|
| 151 |
+
pip3 install cupy-cuda12x==13.6.0
|
| 152 |
+
- name: Prepare GSM8K dataset
|
| 153 |
+
run: |
|
| 154 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 155 |
+
- name: Running the E2E test with fully_async_policy algorithm (Megatron)
|
| 156 |
+
run: |
|
| 157 |
+
ray stop --force
|
| 158 |
+
bash tests/special_e2e/run_fully_async_policy.sh
|
| 159 |
+
|
| 160 |
+
cleanup:
|
| 161 |
+
runs-on: ubuntu-latest
|
| 162 |
+
needs: [setup, e2e_fully_async_policy_fsdp2]
|
| 163 |
+
if: always()
|
| 164 |
+
steps:
|
| 165 |
+
- id: destroy-runner
|
| 166 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 167 |
+
with:
|
| 168 |
+
mode: "destroy"
|
| 169 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 170 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_one_step_off_policy.yml
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_one_step_off_policy
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
- "!**/*.md"
|
| 46 |
+
- "!**/*.sh"
|
| 47 |
+
# Other entrypoints
|
| 48 |
+
- "!examples/*trainer*"
|
| 49 |
+
- "!tests/**"
|
| 50 |
+
- "!verl/trainer/main_*.py"
|
| 51 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 52 |
+
- "verl/experimental/one_step_off_policy"
|
| 53 |
+
pull_request:
|
| 54 |
+
branches:
|
| 55 |
+
- main
|
| 56 |
+
- v0.*
|
| 57 |
+
paths:
|
| 58 |
+
- "**/*.py"
|
| 59 |
+
- "!**/*.md"
|
| 60 |
+
- "!**/*.sh"
|
| 61 |
+
# Other entrypoints
|
| 62 |
+
- "!examples/**"
|
| 63 |
+
- "!tests/**"
|
| 64 |
+
- "!verl/trainer/main_*.py"
|
| 65 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 66 |
+
# Home
|
| 67 |
+
- "verl/experimental/one_step_off_policy"
|
| 68 |
+
# Entrypoints
|
| 69 |
+
- ".github/workflows/e2e_one_step_off_policy.yml"
|
| 70 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 71 |
+
- "tests/special_e2e/run_one_step_off_policy.sh"
|
| 72 |
+
|
| 73 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 74 |
+
concurrency:
|
| 75 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 76 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 77 |
+
|
| 78 |
+
# Declare permissions just read content.
|
| 79 |
+
permissions:
|
| 80 |
+
contents: read
|
| 81 |
+
|
| 82 |
+
env:
|
| 83 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 84 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 85 |
+
|
| 86 |
+
jobs:
|
| 87 |
+
setup:
|
| 88 |
+
if: github.repository_owner == 'verl-project'
|
| 89 |
+
runs-on: ubuntu-latest
|
| 90 |
+
outputs:
|
| 91 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 92 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 93 |
+
steps:
|
| 94 |
+
- uses: actions/checkout@v4
|
| 95 |
+
- id: create-runner
|
| 96 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 97 |
+
with:
|
| 98 |
+
mode: "create"
|
| 99 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 100 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 101 |
+
|
| 102 |
+
# Test FSDP2 strategy
|
| 103 |
+
e2e_one_step_off_policy_fsdp2:
|
| 104 |
+
needs: setup
|
| 105 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 106 |
+
timeout-minutes: 10 # Increase timeout for async training
|
| 107 |
+
env:
|
| 108 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 109 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 110 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 111 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 112 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 113 |
+
ACTOR_STRATEGY: "fsdp2"
|
| 114 |
+
steps:
|
| 115 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 116 |
+
with:
|
| 117 |
+
fetch-depth: 0
|
| 118 |
+
- name: Install the current repository
|
| 119 |
+
run: |
|
| 120 |
+
pip3 install -r requirements-test.txt
|
| 121 |
+
pip3 install --no-deps -e .
|
| 122 |
+
pip3 install cupy-cuda12x==13.6.0
|
| 123 |
+
- name: Prepare GSM8K dataset
|
| 124 |
+
run: |
|
| 125 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 126 |
+
- name: Running the E2E test with one_step_off_policy algorithm (FSDP2)
|
| 127 |
+
run: |
|
| 128 |
+
ray stop --force
|
| 129 |
+
bash tests/special_e2e/run_one_step_off_policy.sh
|
| 130 |
+
|
| 131 |
+
# Test Megatron strategy
|
| 132 |
+
e2e_one_step_off_policy_megatron:
|
| 133 |
+
needs: setup
|
| 134 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 135 |
+
timeout-minutes: 10 # Increase timeout for async training
|
| 136 |
+
env:
|
| 137 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 138 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 139 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 140 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 141 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 142 |
+
ACTOR_STRATEGY: "megatron"
|
| 143 |
+
steps:
|
| 144 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 145 |
+
with:
|
| 146 |
+
fetch-depth: 0
|
| 147 |
+
- name: Install the current repository
|
| 148 |
+
run: |
|
| 149 |
+
pip3 install -r requirements-test.txt
|
| 150 |
+
pip3 install --no-deps -e .
|
| 151 |
+
pip3 install cupy-cuda12x==13.6.0
|
| 152 |
+
- name: Prepare GSM8K dataset
|
| 153 |
+
run: |
|
| 154 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 155 |
+
- name: Running the E2E test with one_step_off_policy algorithm (Megatron)
|
| 156 |
+
run: |
|
| 157 |
+
ray stop --force
|
| 158 |
+
bash tests/special_e2e/run_one_step_off_policy.sh
|
| 159 |
+
|
| 160 |
+
cleanup:
|
| 161 |
+
runs-on: ubuntu-latest
|
| 162 |
+
needs:
|
| 163 |
+
[setup, e2e_one_step_off_policy_fsdp2, e2e_one_step_off_policy_megatron]
|
| 164 |
+
if: always()
|
| 165 |
+
steps:
|
| 166 |
+
- id: destroy-runner
|
| 167 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 168 |
+
with:
|
| 169 |
+
mode: "destroy"
|
| 170 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 171 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_one_step_off_policy_ascend.yml
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_one_step_off_policy_ascend
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
- "!**/*.md"
|
| 46 |
+
- "!**/*.sh"
|
| 47 |
+
# Other entrypoints
|
| 48 |
+
- "!examples/*trainer*"
|
| 49 |
+
- "!tests/**"
|
| 50 |
+
- "!verl/trainer/main_*.py"
|
| 51 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 52 |
+
- "verl/experimental/one_step_off_policy"
|
| 53 |
+
pull_request:
|
| 54 |
+
branches:
|
| 55 |
+
- main
|
| 56 |
+
- v0.*
|
| 57 |
+
paths:
|
| 58 |
+
- "**/*.py"
|
| 59 |
+
- "!**/*.md"
|
| 60 |
+
- "!**/*.sh"
|
| 61 |
+
# Other entrypoints
|
| 62 |
+
- "!examples/**"
|
| 63 |
+
- "!tests/**"
|
| 64 |
+
- "!verl/trainer/main_*.py"
|
| 65 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 66 |
+
# Home
|
| 67 |
+
- "verl/experimental/one_step_off_policy"
|
| 68 |
+
# Entrypoints
|
| 69 |
+
- ".github/workflows/e2e_one_step_off_policy_ascend.yml"
|
| 70 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 71 |
+
- "tests/special_npu/run_one_step_off_policy.sh"
|
| 72 |
+
|
| 73 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 74 |
+
concurrency:
|
| 75 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 76 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 77 |
+
|
| 78 |
+
# Declare permissions just read content.
|
| 79 |
+
permissions:
|
| 80 |
+
contents: read
|
| 81 |
+
|
| 82 |
+
jobs:
|
| 83 |
+
# Test FSDP2 strategy
|
| 84 |
+
e2e_one_step_off_policy_fsdp2_ascend:
|
| 85 |
+
if: github.repository_owner == 'verl-project'
|
| 86 |
+
runs-on: linux-aarch64-a2b3-8
|
| 87 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 88 |
+
container:
|
| 89 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 90 |
+
options: >-
|
| 91 |
+
--shm-size 16g
|
| 92 |
+
env:
|
| 93 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 94 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 95 |
+
ACTOR_STRATEGY: "fsdp2"
|
| 96 |
+
steps:
|
| 97 |
+
- name: Check npu and CANN info
|
| 98 |
+
run: |
|
| 99 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 100 |
+
npu-smi info
|
| 101 |
+
- name: Check initial pip list from image
|
| 102 |
+
run: |
|
| 103 |
+
pip list
|
| 104 |
+
- name: Checkout verl-project/verl repo
|
| 105 |
+
uses: actions/checkout@v4
|
| 106 |
+
with:
|
| 107 |
+
fetch-depth: 0
|
| 108 |
+
clean: true
|
| 109 |
+
- name: Install the current repository
|
| 110 |
+
run: |
|
| 111 |
+
pip install -r requirements-npu.txt
|
| 112 |
+
pip install --no-deps -e .
|
| 113 |
+
- name: Check final pip list
|
| 114 |
+
run: |
|
| 115 |
+
pip list
|
| 116 |
+
- name: Prepare weights
|
| 117 |
+
run: |
|
| 118 |
+
ln -s /root/.cache/models ~/models
|
| 119 |
+
- name: Prepare GSM8K dataset
|
| 120 |
+
run: |
|
| 121 |
+
python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
|
| 122 |
+
- name: Running the E2E test with one_step_off_policy algorithm (FSDP2)
|
| 123 |
+
run: |
|
| 124 |
+
ray stop --force
|
| 125 |
+
bash tests/special_npu/run_one_step_off_policy.sh
|
| 126 |
+
|
| 127 |
+
# Test Megatron strategy
|
| 128 |
+
e2e_one_step_off_policy_megatron_ascend:
|
| 129 |
+
if: github.repository_owner == 'verl-project'
|
| 130 |
+
runs-on: linux-aarch64-a2b3-8
|
| 131 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 132 |
+
container:
|
| 133 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 134 |
+
options: >-
|
| 135 |
+
--shm-size 16g
|
| 136 |
+
env:
|
| 137 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 138 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 139 |
+
ACTOR_STRATEGY: "megatron"
|
| 140 |
+
steps:
|
| 141 |
+
- name: Check npu and CANN info
|
| 142 |
+
run: |
|
| 143 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 144 |
+
npu-smi info
|
| 145 |
+
- name: Check initial pip list from image
|
| 146 |
+
run: |
|
| 147 |
+
pip list
|
| 148 |
+
- name: Checkout verl-project/verl repo
|
| 149 |
+
uses: actions/checkout@v4
|
| 150 |
+
with:
|
| 151 |
+
fetch-depth: 0
|
| 152 |
+
clean: true
|
| 153 |
+
- name: Install the current repository
|
| 154 |
+
run: |
|
| 155 |
+
pip install -r requirements-npu.txt
|
| 156 |
+
pip install --no-deps -e .
|
| 157 |
+
- name: Check final pip list
|
| 158 |
+
run: |
|
| 159 |
+
pip list
|
| 160 |
+
- name: Prepare weights
|
| 161 |
+
run: |
|
| 162 |
+
ln -s /root/.cache/models ~/models
|
| 163 |
+
- name: Prepare GSM8K dataset
|
| 164 |
+
run: |
|
| 165 |
+
python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
|
| 166 |
+
- name: Running the E2E test with one_step_off_policy algorithm (Megatron)
|
| 167 |
+
run: |
|
| 168 |
+
ray stop --force
|
| 169 |
+
bash tests/special_npu/run_one_step_off_policy.sh
|
.github/workflows/e2e_ppo_grpo_trainer_trtllm.yml
ADDED
|
@@ -0,0 +1,285 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_ppo_trainer_megatron_trtllm
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch.
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
# Other entrypoints
|
| 46 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 47 |
+
# Recipes
|
| 48 |
+
- "!recipe/**"
|
| 49 |
+
# FSDP
|
| 50 |
+
- "!verl/workers/**/*dp_*.py"
|
| 51 |
+
pull_request:
|
| 52 |
+
branches:
|
| 53 |
+
- main
|
| 54 |
+
- v0.*
|
| 55 |
+
paths:
|
| 56 |
+
- "**/*.py"
|
| 57 |
+
# Other entrypoints
|
| 58 |
+
- "!docker/**"
|
| 59 |
+
# Docs
|
| 60 |
+
- "!**/*.md"
|
| 61 |
+
- "!docs/**"
|
| 62 |
+
- "!examples/**"
|
| 63 |
+
- "!tests/**"
|
| 64 |
+
- "!verl/trainer/main_*.py"
|
| 65 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 66 |
+
# Recipes
|
| 67 |
+
- "!recipe/**"
|
| 68 |
+
# FSDP
|
| 69 |
+
- "!verl/workers/**/*dp_*.py"
|
| 70 |
+
# Entrypoints
|
| 71 |
+
- "verl/workers/rollout/trtllm_rollout/**"
|
| 72 |
+
- "tests/workers/rollout/rollout_trtllm/**"
|
| 73 |
+
- ".github/workflows/e2e_ppo_grpo_trainer_trtllm.yml"
|
| 74 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 75 |
+
- "examples/data_preprocess/geo3k.py"
|
| 76 |
+
- "examples/data_preprocess/dapo_multiturn_w_tool.py"
|
| 77 |
+
- "examples/data_preprocess/aime2024_multiturn_w_tool.py"
|
| 78 |
+
- "examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh"
|
| 79 |
+
- "examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh"
|
| 80 |
+
- "examples/grpo_trainer/run_qwen3-30b_dapo_megatron_fp8_trtllm.sh"
|
| 81 |
+
# add back when ppo flow is ready
|
| 82 |
+
# - "tests/special_e2e/run_ppo_trainer_megatron.sh"
|
| 83 |
+
# - "verl/trainer/main_ppo.py"
|
| 84 |
+
# - "verl/trainer/config/ppo_megatron_trainer.yaml"
|
| 85 |
+
|
| 86 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 87 |
+
concurrency:
|
| 88 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 89 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 90 |
+
|
| 91 |
+
# Declare permissions just read content.
|
| 92 |
+
permissions:
|
| 93 |
+
contents: read
|
| 94 |
+
|
| 95 |
+
env:
|
| 96 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:trtllm1.3.0rc4"
|
| 97 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 98 |
+
|
| 99 |
+
jobs:
|
| 100 |
+
setup:
|
| 101 |
+
if: github.repository_owner == 'verl-project'
|
| 102 |
+
runs-on: ubuntu-latest
|
| 103 |
+
outputs:
|
| 104 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 105 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 106 |
+
steps:
|
| 107 |
+
- uses: actions/checkout@v4
|
| 108 |
+
- id: create-runner
|
| 109 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 110 |
+
with:
|
| 111 |
+
mode: "create"
|
| 112 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 113 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 114 |
+
|
| 115 |
+
trtllm_unit_tests:
|
| 116 |
+
needs: setup
|
| 117 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 118 |
+
timeout-minutes: 30 # Increase this timeout value as needed
|
| 119 |
+
env:
|
| 120 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 121 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 122 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 123 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 124 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 125 |
+
steps:
|
| 126 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 127 |
+
with:
|
| 128 |
+
fetch-depth: 0
|
| 129 |
+
- name: Install the current repository
|
| 130 |
+
run: |
|
| 131 |
+
pip3 install pytest-asyncio
|
| 132 |
+
pip3 install -r requirements-test.txt
|
| 133 |
+
pip3 install --no-deps -e .
|
| 134 |
+
- name: Run TRTLLM unit tests
|
| 135 |
+
run: |
|
| 136 |
+
export TRTLLM_TEST_MODEL_PATH_ROOT="${HOME}/models"
|
| 137 |
+
ray stop --force
|
| 138 |
+
pytest -v -s \
|
| 139 |
+
tests/workers/rollout/rollout_trtllm/test_adapter.py \
|
| 140 |
+
tests/workers/rollout/rollout_trtllm/test_async_server.py \
|
| 141 |
+
tests/workers/rollout/rollout_trtllm/test_trtllm_rollout_utils.py
|
| 142 |
+
|
| 143 |
+
e2e_grpo_trainer_fsdp-qwen2:
|
| 144 |
+
needs: setup
|
| 145 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 146 |
+
timeout-minutes: 30 # Increase this timeout value as needed
|
| 147 |
+
env:
|
| 148 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 149 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 150 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 151 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 152 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 153 |
+
steps:
|
| 154 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 155 |
+
with:
|
| 156 |
+
fetch-depth: 0
|
| 157 |
+
- name: Install the current repository
|
| 158 |
+
run: |
|
| 159 |
+
pip3 install -r requirements-test.txt
|
| 160 |
+
pip3 install --no-deps -e .
|
| 161 |
+
- name: Prepare GSM8K dataset
|
| 162 |
+
run: |
|
| 163 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_save_dir ${PWD}/data/gsm8k
|
| 164 |
+
- name: Running GSM8K E2E training tests with FSDP on 8 L20 GPUs (Qwen)
|
| 165 |
+
run: |
|
| 166 |
+
ray stop --force
|
| 167 |
+
DATADIR=${HOME}/data \
|
| 168 |
+
bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 2 \
|
| 169 |
+
trainer.total_training_steps=1 \
|
| 170 |
+
data.train_files="['${PWD}/data/gsm8k/train.parquet']" \
|
| 171 |
+
data.val_files="['${PWD}/data/gsm8k/test.parquet']" \
|
| 172 |
+
trainer.logger='["console"]' \
|
| 173 |
+
actor_rollout_ref.model.path="${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct"
|
| 174 |
+
- name: clean up
|
| 175 |
+
run: |
|
| 176 |
+
rm -rf checkpoints
|
| 177 |
+
|
| 178 |
+
e2e_grpo_trainer_megatron-qwen2:
|
| 179 |
+
needs: setup
|
| 180 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 181 |
+
timeout-minutes: 30 # Increase this timeout value as needed
|
| 182 |
+
env:
|
| 183 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 184 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 185 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 186 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 187 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 188 |
+
steps:
|
| 189 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 190 |
+
with:
|
| 191 |
+
fetch-depth: 0
|
| 192 |
+
- name: Install the current repository
|
| 193 |
+
run: |
|
| 194 |
+
pip3 install -r requirements-test.txt
|
| 195 |
+
pip3 install --no-deps -e .
|
| 196 |
+
- name: Prepare GSM8K dataset
|
| 197 |
+
run: |
|
| 198 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_save_dir ${PWD}/data/gsm8k
|
| 199 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
|
| 200 |
+
run: |
|
| 201 |
+
ray stop --force
|
| 202 |
+
DATADIR=${HOME}/data \
|
| 203 |
+
ACTOR_TP=2 \
|
| 204 |
+
bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 2 \
|
| 205 |
+
trainer.total_training_steps=1 \
|
| 206 |
+
data.train_files="['${PWD}/data/gsm8k/train.parquet']" \
|
| 207 |
+
data.val_files="['${PWD}/data/gsm8k/test.parquet']" \
|
| 208 |
+
trainer.logger='["console"]' \
|
| 209 |
+
actor_rollout_ref.model.path="${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct"
|
| 210 |
+
- name: clean up
|
| 211 |
+
run: |
|
| 212 |
+
rm -rf checkpoints
|
| 213 |
+
e2e_grpo_trainer_fsdp-vlm:
|
| 214 |
+
needs: setup
|
| 215 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 216 |
+
timeout-minutes: 30 # Increase this timeout value as needed
|
| 217 |
+
env:
|
| 218 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 219 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 220 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 221 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 222 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 223 |
+
steps:
|
| 224 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 225 |
+
with:
|
| 226 |
+
fetch-depth: 0
|
| 227 |
+
- name: Install the current repository
|
| 228 |
+
run: |
|
| 229 |
+
pip3 install -r requirements-test.txt
|
| 230 |
+
pip3 install --no-deps -e .
|
| 231 |
+
- name: Prepare GEO3K dataset
|
| 232 |
+
run: |
|
| 233 |
+
python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/geo3k --local_save_dir ${PWD}/data/geo3k
|
| 234 |
+
- name: Running GEO3K E2E training tests with FSDP on 8 L20 GPUs (VLM)
|
| 235 |
+
run: |
|
| 236 |
+
ray stop --force
|
| 237 |
+
DATADIR=${HOME}/data \
|
| 238 |
+
bash examples/grpo_trainer/run_qwen2_5_vl_3b_trtllm.sh 2 \
|
| 239 |
+
trainer.total_training_steps=1 \
|
| 240 |
+
data.train_files="['${PWD}/data/geo3k/train.parquet']" \
|
| 241 |
+
data.val_files="['${PWD}/data/geo3k/test.parquet']" \
|
| 242 |
+
trainer.logger='["console"]' \
|
| 243 |
+
actor_rollout_ref.model.path="${HOME}/models/Qwen/Qwen3-VL-2B-Instruct"
|
| 244 |
+
- name: clean up
|
| 245 |
+
run: |
|
| 246 |
+
rm -rf checkpoints
|
| 247 |
+
- name: Prepare DAPO-Math-17k and AIME-2024 datasets (data_preprocess)
|
| 248 |
+
run: |
|
| 249 |
+
python3 examples/data_preprocess/dapo_multiturn_w_tool.py --local_save_dir ${PWD}/data/dapo-math-17k
|
| 250 |
+
python3 examples/data_preprocess/aime2024_multiturn_w_tool.py --local_save_dir ${PWD}/data/aime-2024
|
| 251 |
+
- name: Running DAPO E2E with FP8 TRT-LLM rollout (Qwen3-0.6B)
|
| 252 |
+
run: |
|
| 253 |
+
ray stop --force
|
| 254 |
+
export INFER_TP=2 ACTOR_TP=2 ACTOR_PP=2 ACTOR_VPP=2 ACTOR_EP=1 ACTOR_CP=2 REF_TP=2 REF_PP=2 REF_VPP=2 REF_EP=1 REF_CP=2 GEN_MOE_TP=null GEN_MOE_EP=null
|
| 255 |
+
export NNODES=1 GPUS_PER_NODE=8 TRTLLM_MOE_BACKEND=CUTLASS
|
| 256 |
+
export DATA_DIR=${PWD} DAPO_MATH_TRAIN=${PWD}/data/dapo-math-17k/train.parquet AIME_VAL=${PWD}/data/aime-2024/train.parquet MODEL_PATH=${HOME}/models/Qwen/Qwen3-0.6B
|
| 257 |
+
bash examples/grpo_trainer/run_qwen3-30b_dapo_megatron_fp8_trtllm.sh \
|
| 258 |
+
reward_model.reward_kwargs.overlong_buffer_cfg.len=258 \
|
| 259 |
+
reward_model.reward_kwargs.max_resp_len=512 \
|
| 260 |
+
data.max_prompt_length=512 \
|
| 261 |
+
data.max_response_length=512 \
|
| 262 |
+
data.train_batch_size=32 \
|
| 263 |
+
actor_rollout_ref.rollout.n=4 \
|
| 264 |
+
actor_rollout_ref.rollout.max_num_seqs=16 \
|
| 265 |
+
actor_rollout_ref.rollout.max_num_batched_tokens=1024 \
|
| 266 |
+
actor_rollout_ref.rollout.max_model_len=1024 \
|
| 267 |
+
actor_rollout_ref.actor.megatron.override_transformer_config.moe_grouped_gemm=False \
|
| 268 |
+
actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=False \
|
| 269 |
+
trainer.total_training_steps=1 \
|
| 270 |
+
trainer.logger='["console"]'
|
| 271 |
+
- name: clean up
|
| 272 |
+
run: |
|
| 273 |
+
rm -rf checkpoints
|
| 274 |
+
|
| 275 |
+
cleanup:
|
| 276 |
+
runs-on: ubuntu-latest
|
| 277 |
+
needs: [setup, trtllm_unit_tests, e2e_grpo_trainer_fsdp-qwen2, e2e_grpo_trainer_megatron-qwen2, e2e_grpo_trainer_fsdp-vlm]
|
| 278 |
+
if: always()
|
| 279 |
+
steps:
|
| 280 |
+
- id: destroy-runner
|
| 281 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 282 |
+
with:
|
| 283 |
+
mode: "destroy"
|
| 284 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 285 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_ppo_trainer.yml
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: e2e_ppo_trainer
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
# Trigger the workflow on push or pull request,
|
| 5 |
+
# but only for the main branch
|
| 6 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 7 |
+
# and achieves higher coverage.
|
| 8 |
+
push:
|
| 9 |
+
branches:
|
| 10 |
+
- main
|
| 11 |
+
- v0.*
|
| 12 |
+
paths:
|
| 13 |
+
- "**/*.py"
|
| 14 |
+
# Other entrypoints
|
| 15 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 16 |
+
|
| 17 |
+
# Megatron
|
| 18 |
+
- "!verl/workers/**/megatron_*.py"
|
| 19 |
+
|
| 20 |
+
pull_request:
|
| 21 |
+
branches:
|
| 22 |
+
- main
|
| 23 |
+
- v0.*
|
| 24 |
+
paths:
|
| 25 |
+
- "**/*.py"
|
| 26 |
+
# Other entrypoints
|
| 27 |
+
- "!**/*.md"
|
| 28 |
+
- "!docker/**"
|
| 29 |
+
- "!examples/**"
|
| 30 |
+
- "!tests/**"
|
| 31 |
+
- "!verl/trainer/main_*.py"
|
| 32 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 33 |
+
# Docs
|
| 34 |
+
- "!docs/**"
|
| 35 |
+
|
| 36 |
+
# Megatron
|
| 37 |
+
- "!verl/workers/**/megatron_*.py"
|
| 38 |
+
# Entrypoints
|
| 39 |
+
- ".github/workflows/e2e_ppo_trainer.yml"
|
| 40 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 41 |
+
- "examples/data_preprocess/geo3k.py"
|
| 42 |
+
- "tests/special_e2e/ppo_trainer"
|
| 43 |
+
- "verl/trainer/main_ppo.py"
|
| 44 |
+
- "verl/trainer/config/ppo_trainer.yaml"
|
| 45 |
+
|
| 46 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 47 |
+
concurrency:
|
| 48 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 49 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 50 |
+
|
| 51 |
+
# Declare permissions just read content.
|
| 52 |
+
permissions:
|
| 53 |
+
contents: read
|
| 54 |
+
|
| 55 |
+
jobs:
|
| 56 |
+
pre_commit_for_ppo:
|
| 57 |
+
runs-on: ubuntu-latest
|
| 58 |
+
strategy:
|
| 59 |
+
matrix:
|
| 60 |
+
python-version: ["3.12"]
|
| 61 |
+
steps:
|
| 62 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 63 |
+
- name: Set up Python ${{ matrix.python-version }}
|
| 64 |
+
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
|
| 65 |
+
with:
|
| 66 |
+
python-version: ${{ matrix.python-version }}
|
| 67 |
+
- name: Install the current repository
|
| 68 |
+
run: |
|
| 69 |
+
pip install pre-commit hydra-core
|
| 70 |
+
pip3 install --no-deps -e .
|
| 71 |
+
- name: Set ruff --output-format=github
|
| 72 |
+
run: |
|
| 73 |
+
sed -i 's/--output-format=full/--output-format=github/' .pre-commit-config.yaml
|
| 74 |
+
git add .pre-commit-config.yaml
|
| 75 |
+
- uses: pre-commit/action@v3.0.1
|
| 76 |
+
with:
|
| 77 |
+
extra_args: "" # Overriding default "--all-files"
|
| 78 |
+
|
.github/workflows/e2e_ppo_trainer_megatron_sglang.yml
ADDED
|
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_ppo_trainer_megatron_sglang
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch.
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
# Other entrypoints
|
| 46 |
+
- "!verl/trainer/fsdp_sft_trainer.py" # FSDP
|
| 47 |
+
- "!verl/workers/**/*dp_*.py"
|
| 48 |
+
- "!verl/utils/fsdp_utils.py"
|
| 49 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 50 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 51 |
+
pull_request:
|
| 52 |
+
branches:
|
| 53 |
+
- main
|
| 54 |
+
- v0.*
|
| 55 |
+
paths:
|
| 56 |
+
- "**/*.py"
|
| 57 |
+
# Other entrypoints
|
| 58 |
+
- "!docker/**"
|
| 59 |
+
# Docs
|
| 60 |
+
- "!**/*.md"
|
| 61 |
+
- "!docs/**"
|
| 62 |
+
- "!examples/**"
|
| 63 |
+
- "!tests/**"
|
| 64 |
+
- "!verl/trainer/main_*.py"
|
| 65 |
+
- "!verl/trainer/fsdp_sft_trainer.py" # FSDP
|
| 66 |
+
- "!verl/workers/**/*dp_*.py"
|
| 67 |
+
- "!verl/utils/fsdp_utils.py"
|
| 68 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 69 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 70 |
+
# Entrypoints
|
| 71 |
+
- "verl/worksers/rollout/sglang_rollout/*"
|
| 72 |
+
- ".github/workflows/e2e_ppo_trainer_megatron_sglang.yml"
|
| 73 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 74 |
+
- "examples/data_preprocess/geo3k.py"
|
| 75 |
+
- "tests/special_e2e/run_ppo_trainer_megatron.sh"
|
| 76 |
+
- "verl/trainer/main_ppo.py"
|
| 77 |
+
- "verl/trainer/config/ppo_megatron_trainer.yaml"
|
| 78 |
+
|
| 79 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 80 |
+
concurrency:
|
| 81 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 82 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 83 |
+
|
| 84 |
+
# Declare permissions just read content.
|
| 85 |
+
permissions:
|
| 86 |
+
contents: read
|
| 87 |
+
|
| 88 |
+
env:
|
| 89 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
|
| 90 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 91 |
+
|
| 92 |
+
jobs:
|
| 93 |
+
setup:
|
| 94 |
+
if: github.repository_owner == 'verl-project'
|
| 95 |
+
runs-on: ubuntu-latest
|
| 96 |
+
outputs:
|
| 97 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 98 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 99 |
+
steps:
|
| 100 |
+
- uses: actions/checkout@v4
|
| 101 |
+
- id: create-runner
|
| 102 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 103 |
+
with:
|
| 104 |
+
mode: "create"
|
| 105 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 106 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 107 |
+
|
| 108 |
+
e2e_ppo_trainer_megatron-deepseek:
|
| 109 |
+
needs: setup
|
| 110 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 111 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 112 |
+
env:
|
| 113 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 114 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 115 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 116 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 117 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 118 |
+
ENGINE: sglang
|
| 119 |
+
steps:
|
| 120 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 121 |
+
with:
|
| 122 |
+
fetch-depth: 0
|
| 123 |
+
- name: Install the current repository
|
| 124 |
+
run: |
|
| 125 |
+
pip3 install -r requirements-test.txt
|
| 126 |
+
pip3 install --no-deps -e .
|
| 127 |
+
- name: Prepare GSM8K dataset
|
| 128 |
+
run: |
|
| 129 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 130 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
|
| 131 |
+
run: |
|
| 132 |
+
ray stop --force
|
| 133 |
+
OPTIM_MEMORY_EFFICIENT=True ENGINE=sglang SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 134 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
|
| 135 |
+
run: |
|
| 136 |
+
ray stop --force
|
| 137 |
+
export VLLM_USE_V1=1
|
| 138 |
+
ray start --head
|
| 139 |
+
ENGINE=sglang MODE=async RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct TOTAL_TRAIN_STEPS=2 bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 140 |
+
- name: Profiling GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Deepseek)
|
| 141 |
+
run: |
|
| 142 |
+
ray stop --force
|
| 143 |
+
PROFILE_ENABLE=True ENGINE=sglang ADV_ESTIMATOR=grpo USE_DYNAMIC_BSZ=False MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 144 |
+
if [ -z "$( ls -A '/tmp/ray/session_latest/logs/nsight/' )" ]; then
|
| 145 |
+
echo "[ERROR] not found any profiling files"
|
| 146 |
+
exit 1
|
| 147 |
+
else
|
| 148 |
+
echo "[SUCCESS] profile success"
|
| 149 |
+
fi
|
| 150 |
+
- name: clean up
|
| 151 |
+
run: |
|
| 152 |
+
rm -rf checkpoints
|
| 153 |
+
|
| 154 |
+
# Qwen3-0.6B: dense, tie_word_embeddings=True
|
| 155 |
+
e2e_ppo_trainer_megatron-qwen3:
|
| 156 |
+
needs: setup
|
| 157 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 158 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 159 |
+
env:
|
| 160 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 161 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 162 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 163 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 164 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 165 |
+
ENGINE: sglang
|
| 166 |
+
steps:
|
| 167 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 168 |
+
with:
|
| 169 |
+
fetch-depth: 0
|
| 170 |
+
- name: Install the current repository
|
| 171 |
+
run: |
|
| 172 |
+
pip3 install -r requirements-test.txt
|
| 173 |
+
pip3 install --no-deps -e .
|
| 174 |
+
- name: Prepare GSM8K dataset
|
| 175 |
+
run: |
|
| 176 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 177 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) testing learning rate scheduler
|
| 178 |
+
run: |
|
| 179 |
+
ray stop --force
|
| 180 |
+
ALL_OFFLOAD=True VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 LR_WARMUP_STEPS=1 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 181 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with FP8 rollout
|
| 182 |
+
run: |
|
| 183 |
+
ray stop --force
|
| 184 |
+
export VLLM_USE_V1=1
|
| 185 |
+
ROLLOUT_QUANTIZATION=fp8 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 186 |
+
- name: clean up
|
| 187 |
+
run: |
|
| 188 |
+
rm -rf checkpoints
|
| 189 |
+
|
| 190 |
+
cleanup:
|
| 191 |
+
runs-on: ubuntu-latest
|
| 192 |
+
needs:
|
| 193 |
+
[setup, e2e_ppo_trainer_megatron-deepseek, e2e_ppo_trainer_megatron-qwen3]
|
| 194 |
+
if: always()
|
| 195 |
+
steps:
|
| 196 |
+
- id: destroy-runner
|
| 197 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 198 |
+
with:
|
| 199 |
+
mode: "destroy"
|
| 200 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 201 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_ppo_trainer_megatron_sglang_2.yml
ADDED
|
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_ppo_trainer_megatron_sglang_2
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch.
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
# Other entrypoints
|
| 46 |
+
- "!verl/trainer/fsdp_sft_trainer.py" # FSDP
|
| 47 |
+
- "!verl/workers/**/*dp_*.py"
|
| 48 |
+
- "!verl/utils/fsdp_utils.py"
|
| 49 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 50 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 51 |
+
pull_request:
|
| 52 |
+
branches:
|
| 53 |
+
- main
|
| 54 |
+
- v0.*
|
| 55 |
+
paths:
|
| 56 |
+
- "**/*.py"
|
| 57 |
+
# Other entrypoints
|
| 58 |
+
- "!docker/**"
|
| 59 |
+
# Docs
|
| 60 |
+
- "!**/*.md"
|
| 61 |
+
- "!docs/**"
|
| 62 |
+
- "!examples/**"
|
| 63 |
+
- "!tests/**"
|
| 64 |
+
- "!verl/trainer/main_*.py"
|
| 65 |
+
- "!verl/trainer/fsdp_sft_trainer.py" # FSDP
|
| 66 |
+
- "!verl/workers/**/*dp_*.py"
|
| 67 |
+
- "!verl/utils/fsdp_utils.py"
|
| 68 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 69 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 70 |
+
# Entrypoints
|
| 71 |
+
- "verl/worksers/rollout/sglang_rollout/*"
|
| 72 |
+
- ".github/workflows/e2e_ppo_trainer_megatron_sglang.yml"
|
| 73 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 74 |
+
- "examples/data_preprocess/geo3k.py"
|
| 75 |
+
- "tests/special_e2e/run_ppo_trainer_megatron.sh"
|
| 76 |
+
- "verl/trainer/main_ppo.py"
|
| 77 |
+
- "verl/trainer/config/ppo_megatron_trainer.yaml"
|
| 78 |
+
|
| 79 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 80 |
+
concurrency:
|
| 81 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 82 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 83 |
+
|
| 84 |
+
# Declare permissions just read content.
|
| 85 |
+
permissions:
|
| 86 |
+
contents: read
|
| 87 |
+
|
| 88 |
+
env:
|
| 89 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
|
| 90 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 91 |
+
|
| 92 |
+
jobs:
|
| 93 |
+
setup:
|
| 94 |
+
if: github.repository_owner == 'verl-project'
|
| 95 |
+
runs-on: ubuntu-latest
|
| 96 |
+
outputs:
|
| 97 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 98 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 99 |
+
steps:
|
| 100 |
+
- uses: actions/checkout@v4
|
| 101 |
+
- id: create-runner
|
| 102 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 103 |
+
with:
|
| 104 |
+
mode: "create"
|
| 105 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 106 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 107 |
+
|
| 108 |
+
e2e_ppo_trainer_fsdp_sglang:
|
| 109 |
+
needs: setup
|
| 110 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 111 |
+
timeout-minutes: 40 # Increase this timeout value as needed
|
| 112 |
+
env:
|
| 113 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 114 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 115 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 116 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 117 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 118 |
+
steps:
|
| 119 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 120 |
+
with:
|
| 121 |
+
fetch-depth: 0
|
| 122 |
+
- name: Install the current repository
|
| 123 |
+
run: |
|
| 124 |
+
pip3 install -r requirements-test.txt
|
| 125 |
+
pip3 install --no-deps -e .
|
| 126 |
+
- name: Prepare gsm8k dataset
|
| 127 |
+
run: |
|
| 128 |
+
ray stop --force
|
| 129 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 130 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm and save ckpt
|
| 131 |
+
run: |
|
| 132 |
+
ray stop --force
|
| 133 |
+
ENGINE=sglang bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 134 |
+
|
| 135 |
+
e2e_ppo_trainer_fsdp-qwen2_5vl-3b:
|
| 136 |
+
needs: setup
|
| 137 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 138 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 139 |
+
env:
|
| 140 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 141 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 142 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 143 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 144 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 145 |
+
steps:
|
| 146 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 147 |
+
with:
|
| 148 |
+
fetch-depth: 0
|
| 149 |
+
- name: Install the current repository
|
| 150 |
+
run: |
|
| 151 |
+
pip3 install -r requirements-test.txt
|
| 152 |
+
pip3 install --no-deps -e .
|
| 153 |
+
# Geo3k
|
| 154 |
+
- name: Prepare GEO3K dataset
|
| 155 |
+
run: |
|
| 156 |
+
ray stop --force
|
| 157 |
+
python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/hiyouga/geometry3k/
|
| 158 |
+
- name: Running GEO3K VLM E2E training tests on 8 L20 GPUs with rmpad using function rm
|
| 159 |
+
run: |
|
| 160 |
+
ray stop --force
|
| 161 |
+
TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 162 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 163 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 164 |
+
ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 165 |
+
ENGINE=sglang ROLLOUT_MODE=async GPU_MEMORY_UTILIZATION=0.6 ACTOR_FSDP_PARAM_OFFLOAD=True \
|
| 166 |
+
ACTOR_FSDP_OPTIMIZER_OFFLOAD=True REF_FSDP_PARAM_OFFLOAD=True \
|
| 167 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 168 |
+
- name: Running GEO3K VLM E2E with rmpad using torch fused kernel (Qwen2.5-VL)
|
| 169 |
+
run: |
|
| 170 |
+
ray stop --force
|
| 171 |
+
FUSED_KERNELS=True TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 172 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 173 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 174 |
+
ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 175 |
+
ENGINE=sglang ROLLOUT_MODE=async GPU_MEMORY_UTILIZATION=0.6 ACTOR_FSDP_PARAM_OFFLOAD=True \
|
| 176 |
+
ACTOR_FSDP_OPTIMIZER_OFFLOAD=True REF_FSDP_PARAM_OFFLOAD=True \
|
| 177 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 178 |
+
- name: Running GEO3K VLM E2E with rmpad using triton fused kernel (Qwen2.5-VL)
|
| 179 |
+
run: |
|
| 180 |
+
ray stop --force
|
| 181 |
+
FUSED_KERNELS=True FUSED_KERNEL_BACKEND=triton \
|
| 182 |
+
TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 183 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 184 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 185 |
+
ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 186 |
+
ENGINE=sglang ROLLOUT_MODE=async GPU_MEMORY_UTILIZATION=0.6 ACTOR_FSDP_PARAM_OFFLOAD=True \
|
| 187 |
+
ACTOR_FSDP_OPTIMIZER_OFFLOAD=True REF_FSDP_PARAM_OFFLOAD=True \
|
| 188 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 189 |
+
|
| 190 |
+
cleanup:
|
| 191 |
+
runs-on: ubuntu-latest
|
| 192 |
+
needs:
|
| 193 |
+
[setup, e2e_ppo_trainer_fsdp-qwen2_5vl-3b, e2e_ppo_trainer_fsdp_sglang]
|
| 194 |
+
if: always()
|
| 195 |
+
steps:
|
| 196 |
+
- id: destroy-runner
|
| 197 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 198 |
+
with:
|
| 199 |
+
mode: "destroy"
|
| 200 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 201 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_ppo_trainer_megatron_vllm.yml
ADDED
|
@@ -0,0 +1,212 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_ppo_trainer_megatron_vllm
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch.
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
# Other entrypoints
|
| 46 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 47 |
+
# FSDP
|
| 48 |
+
- "!verl/workers/**/*dp_*.py"
|
| 49 |
+
- "!verl/utils/fsdp_utils.py"
|
| 50 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 51 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 52 |
+
pull_request:
|
| 53 |
+
branches:
|
| 54 |
+
- main
|
| 55 |
+
- v0.*
|
| 56 |
+
paths:
|
| 57 |
+
- "**/*.py"
|
| 58 |
+
# Other entrypoints
|
| 59 |
+
- "!docker/**"
|
| 60 |
+
# Docs
|
| 61 |
+
- "!**/*.md"
|
| 62 |
+
- "!docs/**"
|
| 63 |
+
- "!examples/**"
|
| 64 |
+
- "!tests/**"
|
| 65 |
+
- "!verl/trainer/main_*.py"
|
| 66 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 67 |
+
# FSDP
|
| 68 |
+
- "!verl/workers/**/*dp_*.py"
|
| 69 |
+
- "!verl/utils/fsdp_utils.py"
|
| 70 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 71 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 72 |
+
# Entrypoints
|
| 73 |
+
- ".github/workflows/e2e_ppo_trainer_megatron_vllm.yml"
|
| 74 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 75 |
+
- "examples/data_preprocess/geo3k.py"
|
| 76 |
+
- "tests/special_e2e/run_ppo_trainer_megatron.sh"
|
| 77 |
+
- "verl/trainer/main_ppo.py"
|
| 78 |
+
- "verl/trainer/config/ppo_megatron_trainer.yaml"
|
| 79 |
+
|
| 80 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 81 |
+
concurrency:
|
| 82 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 83 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 84 |
+
|
| 85 |
+
# Declare permissions just read content.
|
| 86 |
+
permissions:
|
| 87 |
+
contents: read
|
| 88 |
+
|
| 89 |
+
env:
|
| 90 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 91 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 92 |
+
|
| 93 |
+
jobs:
|
| 94 |
+
setup:
|
| 95 |
+
if: github.repository_owner == 'verl-project'
|
| 96 |
+
runs-on: ubuntu-latest
|
| 97 |
+
outputs:
|
| 98 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 99 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 100 |
+
steps:
|
| 101 |
+
- uses: actions/checkout@v4
|
| 102 |
+
- id: create-runner
|
| 103 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 104 |
+
with:
|
| 105 |
+
mode: "create"
|
| 106 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 107 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 108 |
+
|
| 109 |
+
# deepseek-ai/deepseek-coder-1.3b-instruct: dense, tie_word_embeddings=False
|
| 110 |
+
e2e_ppo_trainer_megatron-deepseek:
|
| 111 |
+
needs: setup
|
| 112 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 113 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 114 |
+
env:
|
| 115 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 116 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 117 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 118 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 119 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 120 |
+
steps:
|
| 121 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 122 |
+
with:
|
| 123 |
+
fetch-depth: 0
|
| 124 |
+
- name: Install the current repository
|
| 125 |
+
run: |
|
| 126 |
+
pip3 install -r requirements-test.txt
|
| 127 |
+
pip3 install --no-deps --force-reinstall .
|
| 128 |
+
pip3 install mbridge
|
| 129 |
+
pip3 install math-verify
|
| 130 |
+
- name: Prepare GSM8K dataset
|
| 131 |
+
run: |
|
| 132 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 133 |
+
# Full training save&load
|
| 134 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use mbridge e2e to pre-load and save (Deepseek)
|
| 135 |
+
run: |
|
| 136 |
+
ray stop --force
|
| 137 |
+
ALL_OFFLOAD=True SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 USE_MBRIDGE=True USE_DIST_CKPT=False \
|
| 138 |
+
bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 139 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use mbridge e2e to pre-load and save (Deepseek)
|
| 140 |
+
run: |
|
| 141 |
+
ray stop --force
|
| 142 |
+
RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct TOTAL_TRAIN_STEPS=2 SAVE_FREQ=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 USE_MBRIDGE=True USE_DIST_CKPT=False \
|
| 143 |
+
bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 144 |
+
# LoRA training save&load
|
| 145 |
+
- name: clean up and install Megatron-Bridge
|
| 146 |
+
run: |
|
| 147 |
+
rm -rf checkpoints
|
| 148 |
+
pip3 install git+https://github.com/NVIDIA-NeMo/Megatron-Bridge.git@83a7c11 --no-deps --no-build-isolation
|
| 149 |
+
pip3 install git+https://github.com/NVIDIA/Megatron-LM.git@5455f0a --no-deps --no-build-isolation
|
| 150 |
+
pip3 install "nvidia-modelopt[torch]>=0.37.0" transformers==4.57.1
|
| 151 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use Megatron-Bridge LoRA e2e to pre-load and save (Deepseek)
|
| 152 |
+
run: |
|
| 153 |
+
ray stop --force
|
| 154 |
+
ALL_OFFLOAD=True SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct COMMON_PP=4 LORA_RANK=8 COMMON_VPP=null COMMON_CP=1 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False USE_DIST_CKPT=False \
|
| 155 |
+
bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 156 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use Megatron-Bridge LoRA e2e to pre-load and save (Deepseek)
|
| 157 |
+
run: |
|
| 158 |
+
ray stop --force
|
| 159 |
+
RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct TOTAL_TRAIN_STEPS=2 SAVE_FREQ=1 COMMON_PP=4 LORA_RANK=8 COMMON_VPP=null COMMON_CP=1 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False USE_DIST_CKPT=False \
|
| 160 |
+
bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 161 |
+
- name: clean up
|
| 162 |
+
run: |
|
| 163 |
+
rm -rf checkpoints
|
| 164 |
+
|
| 165 |
+
# Qwen3-0.6B: dense, tie_word_embeddings=True
|
| 166 |
+
e2e_ppo_trainer_megatron-qwen3:
|
| 167 |
+
needs: setup
|
| 168 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 169 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 170 |
+
env:
|
| 171 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 172 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 173 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 174 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 175 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 176 |
+
steps:
|
| 177 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 178 |
+
with:
|
| 179 |
+
fetch-depth: 0
|
| 180 |
+
- name: Install the current repository
|
| 181 |
+
run: |
|
| 182 |
+
pip3 install -r requirements-test.txt
|
| 183 |
+
pip3 install --no-deps -e .
|
| 184 |
+
pip3 install math-verify
|
| 185 |
+
- name: Prepare GSM8K dataset
|
| 186 |
+
run: |
|
| 187 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 188 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) testing learning rate scheduler
|
| 189 |
+
run: |
|
| 190 |
+
ray stop --force
|
| 191 |
+
ALL_OFFLOAD=True VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 LR_WARMUP_STEPS=1 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 192 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with FP8 rollout
|
| 193 |
+
run: |
|
| 194 |
+
ray stop --force
|
| 195 |
+
export VLLM_USE_V1=1
|
| 196 |
+
ROLLOUT_QUANTIZATION=fp8 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 197 |
+
- name: clean up
|
| 198 |
+
run: |
|
| 199 |
+
rm -rf checkpoints
|
| 200 |
+
|
| 201 |
+
cleanup:
|
| 202 |
+
runs-on: ubuntu-latest
|
| 203 |
+
needs:
|
| 204 |
+
[setup, e2e_ppo_trainer_megatron-deepseek, e2e_ppo_trainer_megatron-qwen3]
|
| 205 |
+
if: always()
|
| 206 |
+
steps:
|
| 207 |
+
- id: destroy-runner
|
| 208 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 209 |
+
with:
|
| 210 |
+
mode: "destroy"
|
| 211 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 212 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_ppo_trainer_megatron_vllm_2.yml
ADDED
|
@@ -0,0 +1,318 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_ppo_trainer_megatron_vllm_2
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch.
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
# Other entrypoints
|
| 46 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 47 |
+
# FSDP
|
| 48 |
+
- "!verl/workers/**/*dp_*.py"
|
| 49 |
+
- "!verl/utils/fsdp_utils.py"
|
| 50 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 51 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 52 |
+
pull_request:
|
| 53 |
+
branches:
|
| 54 |
+
- main
|
| 55 |
+
- v0.*
|
| 56 |
+
paths:
|
| 57 |
+
- "**/*.py"
|
| 58 |
+
# Other entrypoints
|
| 59 |
+
- "!docker/**"
|
| 60 |
+
# Docs
|
| 61 |
+
- "!**/*.md"
|
| 62 |
+
- "!docs/**"
|
| 63 |
+
- "!examples/**"
|
| 64 |
+
- "!tests/**"
|
| 65 |
+
- "!verl/trainer/main_*.py"
|
| 66 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 67 |
+
# FSDP
|
| 68 |
+
- "!verl/workers/**/*dp_*.py"
|
| 69 |
+
- "!verl/utils/fsdp_utils.py"
|
| 70 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 71 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 72 |
+
# Entrypoints
|
| 73 |
+
- ".github/workflows/e2e_ppo_trainer_megatron_vllm_2.yml"
|
| 74 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 75 |
+
- "examples/data_preprocess/geo3k.py"
|
| 76 |
+
- "tests/special_e2e/run_ppo_trainer_megatron.sh"
|
| 77 |
+
- "verl/trainer/main_ppo.py"
|
| 78 |
+
- "verl/trainer/config/ppo_megatron_trainer.yaml"
|
| 79 |
+
|
| 80 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 81 |
+
concurrency:
|
| 82 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 83 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 84 |
+
|
| 85 |
+
# Declare permissions just read content.
|
| 86 |
+
permissions:
|
| 87 |
+
contents: read
|
| 88 |
+
|
| 89 |
+
env:
|
| 90 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 91 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 92 |
+
|
| 93 |
+
jobs:
|
| 94 |
+
setup:
|
| 95 |
+
if: github.repository_owner == 'verl-project'
|
| 96 |
+
runs-on: ubuntu-latest
|
| 97 |
+
outputs:
|
| 98 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 99 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 100 |
+
steps:
|
| 101 |
+
- uses: actions/checkout@v4
|
| 102 |
+
- id: create-runner
|
| 103 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 104 |
+
with:
|
| 105 |
+
mode: "create"
|
| 106 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 107 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 108 |
+
|
| 109 |
+
e2e_ppo_trainer_megatron-moe-expert-parallel:
|
| 110 |
+
needs: setup
|
| 111 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 112 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 113 |
+
env:
|
| 114 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 115 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 116 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 117 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 118 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 119 |
+
steps:
|
| 120 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 121 |
+
with:
|
| 122 |
+
fetch-depth: 0
|
| 123 |
+
- name: Install the current repository
|
| 124 |
+
run: |
|
| 125 |
+
pip3 install -r requirements-test.txt
|
| 126 |
+
pip3 install --no-deps --force-reinstall .
|
| 127 |
+
pip3 install git+https://github.com/NVIDIA-NeMo/Megatron-Bridge.git@83a7c11 --no-deps --no-build-isolation
|
| 128 |
+
pip3 install git+https://github.com/NVIDIA/Megatron-LM.git@5455f0a --no-deps --no-build-isolation
|
| 129 |
+
pip3 install "nvidia-modelopt[torch]>=0.37.0" transformers==4.57.1
|
| 130 |
+
- name: Prepare GSM8K dataset
|
| 131 |
+
run: |
|
| 132 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 133 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron-Bridge (Qwen3-30B-A3B-Instruct-2507)
|
| 134 |
+
run: |
|
| 135 |
+
ray stop --force
|
| 136 |
+
ADV_ESTIMATOR=grpo USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen2moe_minimal.json \
|
| 137 |
+
PPO_MAX_TOKEN_LEN=1024 FWD_MAX_TOKEN_LEN=1024 \
|
| 138 |
+
MAX_PROMPT_LENGTH=512 MAX_RESPONSE_LENGTH=512 \
|
| 139 |
+
MODEL_ID=Qwen/Qwen3-30B-A3B-Instruct-2507 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False \
|
| 140 |
+
COMMON_PP=2 COMMON_VPP=null COMMON_CP=1 COMMON_TP=4 COMMON_EP=4 COMMON_ETP=1 INFER_TP=8 \
|
| 141 |
+
USE_DIST_CKPT=True ALL_OFFLOAD=True SKIP_SAVE_HF_MODEL=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 142 |
+
- name: Running GSM8K E2E training tests with 3D parallelism with FP8 rollout on 8 L20 GPUs with Megatron-Bridge (Qwen3-30B-A3B-Instruct-2507)
|
| 143 |
+
run: |
|
| 144 |
+
ray stop --force
|
| 145 |
+
ADV_ESTIMATOR=grpo USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen2moe_minimal.json \
|
| 146 |
+
PPO_MAX_TOKEN_LEN=1024 FWD_MAX_TOKEN_LEN=1024 \
|
| 147 |
+
MAX_PROMPT_LENGTH=512 MAX_RESPONSE_LENGTH=512 \
|
| 148 |
+
MODEL_ID=Qwen/Qwen3-30B-A3B-Instruct-2507 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False \
|
| 149 |
+
COMMON_PP=2 COMMON_VPP=null COMMON_CP=1 COMMON_TP=4 COMMON_EP=4 COMMON_ETP=1 INFER_TP=2 \
|
| 150 |
+
USE_DIST_CKPT=True ALL_OFFLOAD=True SKIP_SAVE_HF_MODEL=1 ROLLOUT_QUANTIZATION=fp8 bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 151 |
+
- name: clean up
|
| 152 |
+
run: |
|
| 153 |
+
rm -rf checkpoints
|
| 154 |
+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron-Bridge LoRA (Qwen3-30B-A3B-Instruct-2507)
|
| 155 |
+
run: |
|
| 156 |
+
ray stop --force
|
| 157 |
+
ADV_ESTIMATOR=grpo USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen2moe_minimal.json \
|
| 158 |
+
PPO_MAX_TOKEN_LEN=1024 FWD_MAX_TOKEN_LEN=1024 \
|
| 159 |
+
MAX_PROMPT_LENGTH=512 MAX_RESPONSE_LENGTH=512 LORA_RANK=8 CRITIC_LORA_RANK=8 \
|
| 160 |
+
MODEL_ID=Qwen/Qwen3-30B-A3B-Instruct-2507 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False \
|
| 161 |
+
COMMON_PP=2 COMMON_VPP=null COMMON_CP=1 COMMON_TP=4 COMMON_EP=2 COMMON_ETP=1 INFER_TP=8 \
|
| 162 |
+
USE_DIST_CKPT=False LORA_MERGE=True ALL_OFFLOAD=True SKIP_SAVE_HF_MODEL=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh
|
| 163 |
+
- name: clean up
|
| 164 |
+
run: |
|
| 165 |
+
rm -rf checkpoints
|
| 166 |
+
|
| 167 |
+
e2e_ppo_trainer_fsdp_vllm:
|
| 168 |
+
needs: setup
|
| 169 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 170 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 171 |
+
env:
|
| 172 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 173 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 174 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 175 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 176 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 177 |
+
steps:
|
| 178 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 179 |
+
with:
|
| 180 |
+
fetch-depth: 0
|
| 181 |
+
- name: Install the current repository
|
| 182 |
+
run: |
|
| 183 |
+
pip3 install -r requirements-test.txt
|
| 184 |
+
pip3 install --no-deps -e .
|
| 185 |
+
- name: Prepare GSM8K dataset
|
| 186 |
+
run: |
|
| 187 |
+
ray stop --force
|
| 188 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 189 |
+
# Function RM
|
| 190 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (FSDP_SIZE=8)
|
| 191 |
+
run: |
|
| 192 |
+
ray stop --force
|
| 193 |
+
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-fsdp-size8" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 194 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm after resuming
|
| 195 |
+
run: |
|
| 196 |
+
ray stop --force
|
| 197 |
+
RESUME_MODE=auto VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-fsdp-size8" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 198 |
+
- name: Test merging FSDP checkpoints (Qwen Actor)
|
| 199 |
+
run: |
|
| 200 |
+
exp_name="qwen2.5-0.5b-function-reward-minimal-fsdp-size8"
|
| 201 |
+
python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
|
| 202 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (DDP_SIZE=2, FSDP_SIZE=4)
|
| 203 |
+
run: |
|
| 204 |
+
ray stop --force
|
| 205 |
+
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True FSDP_SIZE=4 USE_KL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-ddp-size2-fsdp-size4" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 206 |
+
- name: Test merging DDP+FSDP checkpoints (Qwen Actor)
|
| 207 |
+
run: |
|
| 208 |
+
exp_name="qwen2.5-0.5b-function-reward-minimal-ddp-size2-fsdp-size4"
|
| 209 |
+
python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
|
| 210 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (FSDP2)
|
| 211 |
+
run: |
|
| 212 |
+
ray stop --force
|
| 213 |
+
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-fsdp2-size8" STRATEGY=fsdp2 bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 214 |
+
- name: Test merging FSDP2 checkpoints (Qwen Actor)
|
| 215 |
+
run: |
|
| 216 |
+
exp_name="qwen2.5-0.5b-function-reward-minimal-fsdp2-size8"
|
| 217 |
+
python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
|
| 218 |
+
- name: Running GSM8K E2E without rmpad using function rm
|
| 219 |
+
run: |
|
| 220 |
+
ray stop --force
|
| 221 |
+
RM_PAD=False bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 222 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm (GRPO)
|
| 223 |
+
run: |
|
| 224 |
+
ray stop --force
|
| 225 |
+
CUSTOM_REWARD_FN=True ADV_ESTIMATOR=grpo USE_KL=True bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 226 |
+
# - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm (ReMax)
|
| 227 |
+
# run: |
|
| 228 |
+
# ray stop --force
|
| 229 |
+
# ADV_ESTIMATOR=remax USE_KL=True bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 230 |
+
# LoRA tests
|
| 231 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm
|
| 232 |
+
run: |
|
| 233 |
+
ray stop --force
|
| 234 |
+
ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 235 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm and layered_summon
|
| 236 |
+
run: |
|
| 237 |
+
ray stop --force
|
| 238 |
+
ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors LAYERED_SUMMON=True TOTAL_TRAIN_STEPS=1 SAVE_FREQ=1 FSDP_SIZE=4 VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 239 |
+
- name: Test GRPO LoRA checkpoints merging function
|
| 240 |
+
run: |
|
| 241 |
+
export EXP_NAME="qwen2.5-0.5b-function-reward-minimal"
|
| 242 |
+
ls checkpoints/verl-test/${EXP_NAME}/global_step_1/actor
|
| 243 |
+
cat checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/huggingface/config.json
|
| 244 |
+
python3 -m verl.model_merger merge --backend fsdp --local_dir checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/ --target_dir checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/huggingface
|
| 245 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm and layered_summon with fsdp2
|
| 246 |
+
run: |
|
| 247 |
+
ray stop --force
|
| 248 |
+
ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors LAYERED_SUMMON=True STRATEGY=fsdp2 bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 249 |
+
|
| 250 |
+
e2e_ppo_trainer_fsdp-qwen2_5vl-3b:
|
| 251 |
+
needs: setup
|
| 252 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 253 |
+
timeout-minutes: 40 # Increase this timeout value as needed
|
| 254 |
+
env:
|
| 255 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 256 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 257 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 258 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 259 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 260 |
+
steps:
|
| 261 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 262 |
+
with:
|
| 263 |
+
fetch-depth: 0
|
| 264 |
+
- name: Install the current repository
|
| 265 |
+
run: |
|
| 266 |
+
pip3 install -r requirements-test.txt
|
| 267 |
+
pip3 install --no-deps -e .
|
| 268 |
+
# Geo3k
|
| 269 |
+
- name: Prepare GEO3K dataset
|
| 270 |
+
run: |
|
| 271 |
+
python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/hiyouga/geometry3k/
|
| 272 |
+
- name: Running GEO3K VLM GRPO E2E training tests on 8 L20 GPUs with rmpad using function rm
|
| 273 |
+
run: |
|
| 274 |
+
ray stop --force
|
| 275 |
+
TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 276 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 277 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 278 |
+
ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 279 |
+
SP_SIZE=2 \
|
| 280 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 281 |
+
|
| 282 |
+
- name: Running GEO3K VLM PPO E2E training tests on 8 L20 GPUs with rmpad using function rm
|
| 283 |
+
run: |
|
| 284 |
+
ray stop --force
|
| 285 |
+
TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 286 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 287 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 288 |
+
ADV_ESTIMATOR=gae RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 289 |
+
SP_SIZE=2 \
|
| 290 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 291 |
+
- name: Running GEO3K VLM GRPO E2E lora training tests on 8 L20 GPUs with rmpad using function rm
|
| 292 |
+
run: |
|
| 293 |
+
ray stop --force
|
| 294 |
+
TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 295 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 296 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 297 |
+
ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 298 |
+
SP_SIZE=2 \
|
| 299 |
+
LORA_RANK=32 LORA_EXCLUDE=".*visual.*" \
|
| 300 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 301 |
+
|
| 302 |
+
cleanup:
|
| 303 |
+
runs-on: ubuntu-latest
|
| 304 |
+
needs:
|
| 305 |
+
[
|
| 306 |
+
setup,
|
| 307 |
+
e2e_ppo_trainer_megatron-moe-expert-parallel,
|
| 308 |
+
e2e_ppo_trainer_fsdp-qwen2_5vl-3b,
|
| 309 |
+
e2e_ppo_trainer_fsdp_vllm,
|
| 310 |
+
]
|
| 311 |
+
if: always()
|
| 312 |
+
steps:
|
| 313 |
+
- id: destroy-runner
|
| 314 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 315 |
+
with:
|
| 316 |
+
mode: "destroy"
|
| 317 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 318 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_ppo_trainer_megatron_vllm_2_ascend.yml
ADDED
|
@@ -0,0 +1,233 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_ppo_trainer_megatron_vllm_2_ascend
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch.
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
# Other entrypoints
|
| 46 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 47 |
+
# FSDP
|
| 48 |
+
- "!verl/workers/**/*dp_*.py"
|
| 49 |
+
- "!verl/utils/fsdp_utils.py"
|
| 50 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 51 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 52 |
+
pull_request:
|
| 53 |
+
branches:
|
| 54 |
+
- main
|
| 55 |
+
- v0.*
|
| 56 |
+
paths:
|
| 57 |
+
- "**/*.py"
|
| 58 |
+
# Other entrypoints
|
| 59 |
+
- "!docker/**"
|
| 60 |
+
# Docs
|
| 61 |
+
- "!**/*.md"
|
| 62 |
+
- "!docs/**"
|
| 63 |
+
- "!examples/**"
|
| 64 |
+
- "!tests/**"
|
| 65 |
+
- "!verl/trainer/main_*.py"
|
| 66 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 67 |
+
# FSDP
|
| 68 |
+
- "!verl/workers/**/*dp_*.py"
|
| 69 |
+
- "!verl/utils/fsdp_utils.py"
|
| 70 |
+
- "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
|
| 71 |
+
- "!verl/model_merger/fsdp_model_merger.py"
|
| 72 |
+
# Entrypoints
|
| 73 |
+
- ".github/workflows/e2e_ppo_trainer_megatron_vllm_2_ascend.yml"
|
| 74 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 75 |
+
- "examples/data_preprocess/geo3k.py"
|
| 76 |
+
- "tests/special_e2e/run_ppo_trainer_megatron.sh"
|
| 77 |
+
- "verl/trainer/main_ppo.py"
|
| 78 |
+
- "verl/trainer/config/ppo_megatron_trainer.yaml"
|
| 79 |
+
|
| 80 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 81 |
+
concurrency:
|
| 82 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 83 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 84 |
+
|
| 85 |
+
# Declare permissions just read content.
|
| 86 |
+
permissions:
|
| 87 |
+
contents: read
|
| 88 |
+
|
| 89 |
+
jobs:
|
| 90 |
+
e2e_ppo_trainer_fsdp_vllm_ascend:
|
| 91 |
+
if: github.repository_owner == 'verl-project'
|
| 92 |
+
runs-on: linux-aarch64-a2b3-8
|
| 93 |
+
timeout-minutes: 90 # Increase this timeout value as needed
|
| 94 |
+
container:
|
| 95 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 96 |
+
options: >-
|
| 97 |
+
--shm-size 16g
|
| 98 |
+
env:
|
| 99 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 100 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 101 |
+
steps:
|
| 102 |
+
- name: Check npu and CANN info
|
| 103 |
+
run: |
|
| 104 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 105 |
+
npu-smi info
|
| 106 |
+
- name: Check initial pip list from image
|
| 107 |
+
run: |
|
| 108 |
+
pip list
|
| 109 |
+
- name: Checkout verl-project/verl repo
|
| 110 |
+
uses: actions/checkout@v4
|
| 111 |
+
with:
|
| 112 |
+
fetch-depth: 0
|
| 113 |
+
clean: true
|
| 114 |
+
- name: Install the current repository
|
| 115 |
+
run: |
|
| 116 |
+
pip install -r requirements-npu.txt
|
| 117 |
+
pip install --no-deps -e .
|
| 118 |
+
- name: Check final pip list
|
| 119 |
+
run: |
|
| 120 |
+
pip list
|
| 121 |
+
- name: Prepare weights
|
| 122 |
+
run: |
|
| 123 |
+
ln -s /root/.cache/models ~/models
|
| 124 |
+
- name: Prepare GSM8K dataset
|
| 125 |
+
run: |
|
| 126 |
+
python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
|
| 127 |
+
# Function RM
|
| 128 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (DDP_SIZE=2, FSDP_SIZE=4)
|
| 129 |
+
run: |
|
| 130 |
+
ray stop --force
|
| 131 |
+
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True FSDP_SIZE=4 USE_KL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-ddp-size2-fsdp-size4" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 132 |
+
- name: Test merging DDP+FSDP checkpoints (Qwen Actor)
|
| 133 |
+
run: |
|
| 134 |
+
exp_name="qwen2.5-0.5b-function-reward-minimal-ddp-size2-fsdp-size4"
|
| 135 |
+
python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
|
| 136 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (FSDP2)
|
| 137 |
+
run: |
|
| 138 |
+
ray stop --force
|
| 139 |
+
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-fsdp2-size8" STRATEGY=fsdp2 bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 140 |
+
- name: Test merging FSDP2 checkpoints (Qwen Actor)
|
| 141 |
+
run: |
|
| 142 |
+
exp_name="qwen2.5-0.5b-function-reward-minimal-fsdp2-size8"
|
| 143 |
+
python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
|
| 144 |
+
- name: Running GSM8K E2E without rmpad using function rm
|
| 145 |
+
run: |
|
| 146 |
+
ray stop --force
|
| 147 |
+
RM_PAD=False bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 148 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm (GRPO)
|
| 149 |
+
run: |
|
| 150 |
+
ray stop --force
|
| 151 |
+
CUSTOM_REWARD_FN=True ADV_ESTIMATOR=grpo USE_KL=True bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 152 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm and layered_summon
|
| 153 |
+
run: |
|
| 154 |
+
ray stop --force
|
| 155 |
+
ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors LAYERED_SUMMON=True TOTAL_TRAIN_STEPS=1 SAVE_FREQ=1 FSDP_SIZE=4 VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 156 |
+
- name: Test GRPO LoRA checkpoints merging function
|
| 157 |
+
run: |
|
| 158 |
+
export EXP_NAME="qwen2.5-0.5b-function-reward-minimal"
|
| 159 |
+
ls checkpoints/verl-test/${EXP_NAME}/global_step_1/actor
|
| 160 |
+
cat checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/huggingface/config.json
|
| 161 |
+
python3 -m verl.model_merger merge --backend fsdp --local_dir checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/ --target_dir checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/huggingface
|
| 162 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm and layered_summon with fsdp2
|
| 163 |
+
run: |
|
| 164 |
+
ray stop --force
|
| 165 |
+
ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors LAYERED_SUMMON=True STRATEGY=fsdp2 bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 166 |
+
|
| 167 |
+
e2e_ppo_trainer_fsdp-qwen2_5vl-3b_ascend:
|
| 168 |
+
if: github.repository_owner == 'verl-project'
|
| 169 |
+
runs-on: linux-aarch64-a2b3-8
|
| 170 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 171 |
+
container:
|
| 172 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 173 |
+
options: >-
|
| 174 |
+
--shm-size 16g
|
| 175 |
+
env:
|
| 176 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 177 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 178 |
+
steps:
|
| 179 |
+
- name: Check npu and CANN info
|
| 180 |
+
run: |
|
| 181 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 182 |
+
npu-smi info
|
| 183 |
+
- name: Check initial pip list from image
|
| 184 |
+
run: |
|
| 185 |
+
pip list
|
| 186 |
+
- name: Checkout verl-project/verl repo
|
| 187 |
+
uses: actions/checkout@v4
|
| 188 |
+
with:
|
| 189 |
+
fetch-depth: 0
|
| 190 |
+
clean: true
|
| 191 |
+
- name: Install the current repository
|
| 192 |
+
run: |
|
| 193 |
+
pip install -r requirements-npu.txt
|
| 194 |
+
pip install --no-deps -e .
|
| 195 |
+
pip install trl==0.26.0
|
| 196 |
+
- name: Check final pip list
|
| 197 |
+
run: |
|
| 198 |
+
pip list
|
| 199 |
+
- name: Prepare weights
|
| 200 |
+
run: |
|
| 201 |
+
ln -s /root/.cache/models ~/models
|
| 202 |
+
# Geo3k
|
| 203 |
+
- name: Prepare GEO3K dataset
|
| 204 |
+
run: |
|
| 205 |
+
python examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/.cache/datasets/hiyouga/geometry3k
|
| 206 |
+
- name: Running GEO3K VLM GRPO E2E training tests on 8 L20 GPUs with rmpad using function rm
|
| 207 |
+
run: |
|
| 208 |
+
ray stop --force
|
| 209 |
+
TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 210 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 211 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 212 |
+
ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 213 |
+
SP_SIZE=2 \
|
| 214 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 215 |
+
- name: Running GEO3K VLM PPO E2E training tests on 8 L20 GPUs with rmpad using function rm
|
| 216 |
+
run: |
|
| 217 |
+
ray stop --force
|
| 218 |
+
TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 219 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 220 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 221 |
+
ADV_ESTIMATOR=gae RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 222 |
+
SP_SIZE=2 \
|
| 223 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
| 224 |
+
- name: Running GEO3K VLM GRPO E2E lora training tests on 8 L20 GPUs with rmpad using function rm
|
| 225 |
+
run: |
|
| 226 |
+
ray stop --force
|
| 227 |
+
TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
|
| 228 |
+
MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
|
| 229 |
+
MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
|
| 230 |
+
ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
|
| 231 |
+
SP_SIZE=2 \
|
| 232 |
+
LORA_RANK=32 LORA_EXCLUDE=".*visual.*" \
|
| 233 |
+
bash tests/special_e2e/ppo_trainer/run_function_reward.sh
|
.github/workflows/e2e_ppo_trainer_veomni_vllm.yml
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_ppo_trainer_veomni_vllm
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch.
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
push:
|
| 40 |
+
branches:
|
| 41 |
+
- main
|
| 42 |
+
- v0.*
|
| 43 |
+
paths:
|
| 44 |
+
- "**/*.py"
|
| 45 |
+
# Other entrypoints
|
| 46 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 47 |
+
# Megatron
|
| 48 |
+
- "!verl/workers/**/megatron_*.py"
|
| 49 |
+
pull_request:
|
| 50 |
+
branches:
|
| 51 |
+
- main
|
| 52 |
+
- v0.*
|
| 53 |
+
paths:
|
| 54 |
+
- "**/*.py"
|
| 55 |
+
# Other entrypoints
|
| 56 |
+
- "!docker/**"
|
| 57 |
+
# Docs
|
| 58 |
+
- "!**/*.md"
|
| 59 |
+
- "!docs/**"
|
| 60 |
+
- "!examples/**"
|
| 61 |
+
- "!tests/**"
|
| 62 |
+
- "!verl/trainer/main_*.py"
|
| 63 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 64 |
+
# Megatron
|
| 65 |
+
- "!verl/workers/**/megatron_*.py"
|
| 66 |
+
# Entrypoints
|
| 67 |
+
- ".github/workflows/e2e_ppo_trainer_veomni_vllm.yml"
|
| 68 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 69 |
+
- "examples/data_preprocess/geo3k.py"
|
| 70 |
+
- "tests/special_e2e/run_ppo_trainer_veomni.sh"
|
| 71 |
+
- "verl/trainer/main_ppo.py"
|
| 72 |
+
- "verl/trainer/config/ppo_trainer.yaml"
|
| 73 |
+
|
| 74 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 75 |
+
concurrency:
|
| 76 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 77 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 78 |
+
|
| 79 |
+
# Declare permissions just read content.
|
| 80 |
+
permissions:
|
| 81 |
+
contents: read
|
| 82 |
+
|
| 83 |
+
env:
|
| 84 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 85 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 86 |
+
|
| 87 |
+
jobs:
|
| 88 |
+
setup:
|
| 89 |
+
if: github.repository_owner == 'verl-project'
|
| 90 |
+
runs-on: ubuntu-latest
|
| 91 |
+
outputs:
|
| 92 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 93 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 94 |
+
steps:
|
| 95 |
+
- uses: actions/checkout@v4
|
| 96 |
+
- id: create-runner
|
| 97 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 98 |
+
with:
|
| 99 |
+
mode: "create"
|
| 100 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 101 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 102 |
+
|
| 103 |
+
e2e_ppo_trainer_veomni_vllm:
|
| 104 |
+
needs: setup
|
| 105 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 106 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 107 |
+
env:
|
| 108 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 109 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 110 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 111 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 112 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 113 |
+
steps:
|
| 114 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 115 |
+
with:
|
| 116 |
+
fetch-depth: 0
|
| 117 |
+
- name: Install the current repository
|
| 118 |
+
run: |
|
| 119 |
+
pip3 install -r requirements-test.txt
|
| 120 |
+
pip3 install --no-deps -e .
|
| 121 |
+
pip3 install git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.4
|
| 122 |
+
- name: Prepare GSM8K dataset
|
| 123 |
+
run: |
|
| 124 |
+
ray stop --force
|
| 125 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 126 |
+
- name: Prepare GEO3K dataset
|
| 127 |
+
run: |
|
| 128 |
+
ray stop --force
|
| 129 |
+
python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/hiyouga/geometry3k/
|
| 130 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with veomni engine (FSDP_SIZE=4, USP=2)
|
| 131 |
+
run: |
|
| 132 |
+
ray stop --force
|
| 133 |
+
FSDP_SIZE=4 SP_SIZE=2 bash tests/special_e2e/run_ppo_trainer_veomni.sh
|
| 134 |
+
- name: Running GEO3K E2E training tests on 8 L20 GPUs with veomni engine (FSDP_SIZE=8, USP=1)
|
| 135 |
+
run: |
|
| 136 |
+
ray stop --force
|
| 137 |
+
MODEL_ID=Qwen/Qwen3-VL-2B-Instruct TRAIN_FILES=${HOME}/data/geo3k/train.parquet VAL_FILES=${HOME}/data/gsm8k/test.parquet FSDP_SIZE=8 SP_SIZE=1 bash tests/special_e2e/run_ppo_trainer_veomni.sh
|
| 138 |
+
|
| 139 |
+
cleanup:
|
| 140 |
+
runs-on: ubuntu-latest
|
| 141 |
+
needs:
|
| 142 |
+
[
|
| 143 |
+
setup,
|
| 144 |
+
e2e_ppo_trainer_veomni_vllm,
|
| 145 |
+
]
|
| 146 |
+
if: always()
|
| 147 |
+
steps:
|
| 148 |
+
- id: destroy-runner
|
| 149 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 150 |
+
with:
|
| 151 |
+
mode: "destroy"
|
| 152 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 153 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_sft_llm.yml
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_sft_llm
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
push:
|
| 38 |
+
branches:
|
| 39 |
+
- main
|
| 40 |
+
- v0.*
|
| 41 |
+
pull_request:
|
| 42 |
+
branches:
|
| 43 |
+
- main
|
| 44 |
+
- v0.*
|
| 45 |
+
paths:
|
| 46 |
+
- "**/*.py"
|
| 47 |
+
# Other entrypoints
|
| 48 |
+
- "!examples/**"
|
| 49 |
+
- "!tests/**"
|
| 50 |
+
- "!verl/trainer/main_*.py"
|
| 51 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 52 |
+
|
| 53 |
+
# Megatron
|
| 54 |
+
- "!verl/workers/**/megatron_*.py"
|
| 55 |
+
# Entrypoints
|
| 56 |
+
- ".github/workflows/e2e_sft_llm.yml"
|
| 57 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 58 |
+
- "tests/special_e2e/sft"
|
| 59 |
+
- "verl/trainer/fsdp_sft_trainer.py"
|
| 60 |
+
- "verl/trainer/config/sft_trainer.yaml"
|
| 61 |
+
|
| 62 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 63 |
+
concurrency:
|
| 64 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 65 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 66 |
+
|
| 67 |
+
# Declare permissions just read content.
|
| 68 |
+
permissions:
|
| 69 |
+
contents: read
|
| 70 |
+
|
| 71 |
+
env:
|
| 72 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
|
| 73 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 74 |
+
|
| 75 |
+
jobs:
|
| 76 |
+
setup:
|
| 77 |
+
if: github.repository_owner == 'verl-project'
|
| 78 |
+
runs-on: ubuntu-latest
|
| 79 |
+
outputs:
|
| 80 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 81 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 82 |
+
steps:
|
| 83 |
+
- uses: actions/checkout@v4
|
| 84 |
+
- id: create-runner
|
| 85 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 86 |
+
with:
|
| 87 |
+
mode: "create"
|
| 88 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 89 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 90 |
+
e2e_sft_llm:
|
| 91 |
+
needs: setup
|
| 92 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 93 |
+
timeout-minutes: 30 # Increase this timeout value as needed
|
| 94 |
+
env:
|
| 95 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 96 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 97 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 98 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 99 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 100 |
+
steps:
|
| 101 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 102 |
+
with:
|
| 103 |
+
fetch-depth: 0
|
| 104 |
+
- name: Install the current repository
|
| 105 |
+
run: |
|
| 106 |
+
pip3 install peft
|
| 107 |
+
pip3 install -r requirements-test.txt
|
| 108 |
+
pip3 install --no-deps -e .
|
| 109 |
+
pip3 install git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.4
|
| 110 |
+
- name: Prepare gsm8k dataset
|
| 111 |
+
run: |
|
| 112 |
+
ray stop --force
|
| 113 |
+
python3 examples/data_preprocess/gsm8k_multiturn_sft.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 114 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm
|
| 115 |
+
run: |
|
| 116 |
+
ray stop --force
|
| 117 |
+
bash tests/special_e2e/sft/run_sft.sh
|
| 118 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs w/o rmpad using function rm
|
| 119 |
+
run: |
|
| 120 |
+
ray stop --force
|
| 121 |
+
RM_PAD=False bash tests/special_e2e/sft/run_sft.sh
|
| 122 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with sequence parallism
|
| 123 |
+
run: |
|
| 124 |
+
ray stop --force
|
| 125 |
+
SP_SIZE=2 bash tests/special_e2e/sft/run_sft.sh
|
| 126 |
+
- name: Running GSM8K E2E training tests on 8 L20 GPUs with sequence parallism and liger
|
| 127 |
+
run: |
|
| 128 |
+
ray stop --force
|
| 129 |
+
SP_SIZE=2 LIGER=True bash tests/special_e2e/sft/run_sft.sh
|
| 130 |
+
- name: Running GSM8K E2E training tests with LoRA
|
| 131 |
+
run: |
|
| 132 |
+
ray stop --force
|
| 133 |
+
LORA_RANK=32 bash tests/special_e2e/sft/run_sft.sh
|
| 134 |
+
- name: Run GSM8K E2E training and resume tests resuming from the checkpoint manager
|
| 135 |
+
run: |
|
| 136 |
+
ray stop --force
|
| 137 |
+
LORA_RANK=32 RESUME_MODE=auto TOTAL_TRAIN_STEP=2 bash tests/special_e2e/sft/run_sft.sh
|
| 138 |
+
# TODO: multiturn
|
| 139 |
+
- name: Running GSM8K E2E training tests with multiturn and various configs and compare results
|
| 140 |
+
run: |
|
| 141 |
+
bash tests/special_e2e/sft/test_sft_engine_all.sh
|
| 142 |
+
|
| 143 |
+
cleanup:
|
| 144 |
+
runs-on: ubuntu-latest
|
| 145 |
+
needs: [setup, e2e_sft_llm]
|
| 146 |
+
if: always()
|
| 147 |
+
steps:
|
| 148 |
+
- id: destroy-runner
|
| 149 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 150 |
+
with:
|
| 151 |
+
mode: "destroy"
|
| 152 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 153 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/e2e_sft_llm_ascend.yml
ADDED
|
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_sft_llm_ascend
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
push:
|
| 38 |
+
branches:
|
| 39 |
+
- main
|
| 40 |
+
- v0.*
|
| 41 |
+
pull_request:
|
| 42 |
+
branches:
|
| 43 |
+
- main
|
| 44 |
+
- v0.*
|
| 45 |
+
paths:
|
| 46 |
+
- "**/*.py"
|
| 47 |
+
# Other entrypoints
|
| 48 |
+
- "!examples/**"
|
| 49 |
+
- "!tests/**"
|
| 50 |
+
- "!verl/trainer/main_*.py"
|
| 51 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 52 |
+
|
| 53 |
+
# Megatron
|
| 54 |
+
- "!verl/workers/**/megatron_*.py"
|
| 55 |
+
# Entrypoints
|
| 56 |
+
- ".github/workflows/e2e_sft_llm_ascend.yml"
|
| 57 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 58 |
+
- "tests/special_e2e/sft"
|
| 59 |
+
- "verl/trainer/fsdp_sft_trainer.py"
|
| 60 |
+
- "verl/trainer/config/sft_trainer.yaml"
|
| 61 |
+
|
| 62 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 63 |
+
concurrency:
|
| 64 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 65 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 66 |
+
|
| 67 |
+
# Declare permissions just read content.
|
| 68 |
+
permissions:
|
| 69 |
+
contents: read
|
| 70 |
+
|
| 71 |
+
jobs:
|
| 72 |
+
e2e_sft_llm_ascend:
|
| 73 |
+
if: github.repository_owner == 'verl-project'
|
| 74 |
+
runs-on: linux-aarch64-a2b3-8
|
| 75 |
+
timeout-minutes: 90 # Increase this timeout value as needed
|
| 76 |
+
container:
|
| 77 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 78 |
+
options: >-
|
| 79 |
+
--shm-size 16g
|
| 80 |
+
env:
|
| 81 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 82 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 83 |
+
steps:
|
| 84 |
+
- name: Check npu and CANN info
|
| 85 |
+
run: |
|
| 86 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 87 |
+
npu-smi info
|
| 88 |
+
- name: Check initial pip list from image
|
| 89 |
+
run: |
|
| 90 |
+
pip list
|
| 91 |
+
- name: Checkout verl-project/verl repo
|
| 92 |
+
uses: actions/checkout@v4
|
| 93 |
+
with:
|
| 94 |
+
fetch-depth: 0
|
| 95 |
+
clean: true
|
| 96 |
+
- name: Install the current repository
|
| 97 |
+
run: |
|
| 98 |
+
pip install -r requirements-npu.txt
|
| 99 |
+
pip install -e .
|
| 100 |
+
pip install git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.4
|
| 101 |
+
pip install pandas==2.3.3
|
| 102 |
+
pip uninstall -y mbridge
|
| 103 |
+
pip install git+https://github.com/ISEEKYAN/mbridge.git@89eb10
|
| 104 |
+
- name: Check final pip list
|
| 105 |
+
run: |
|
| 106 |
+
pip list
|
| 107 |
+
- name: Prepare weights
|
| 108 |
+
run: |
|
| 109 |
+
ln -s /root/.cache/models ~/models
|
| 110 |
+
- name: Prepare gsm8k dataset
|
| 111 |
+
run: |
|
| 112 |
+
python3 examples/data_preprocess/gsm8k_multiturn_sft.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
|
| 113 |
+
- name: Running GSM8K E2E training tests on 8 NPUs with rmpad using function rm
|
| 114 |
+
run: |
|
| 115 |
+
ray stop --force
|
| 116 |
+
bash tests/special_e2e/sft/run_sft.sh
|
| 117 |
+
- name: Running GSM8K E2E training tests on 8 NPUs w/o rmpad using function rm
|
| 118 |
+
run: |
|
| 119 |
+
ray stop --force
|
| 120 |
+
RM_PAD=False bash tests/special_e2e/sft/run_sft.sh
|
| 121 |
+
- name: Running GSM8K E2E training tests on 8 NPUs with sequence parallism
|
| 122 |
+
run: |
|
| 123 |
+
ray stop --force
|
| 124 |
+
SP_SIZE=2 bash tests/special_e2e/sft/run_sft.sh
|
| 125 |
+
- name: Running GSM8K E2E training tests with LoRA
|
| 126 |
+
run: |
|
| 127 |
+
ray stop --force
|
| 128 |
+
LORA_RANK=32 bash tests/special_e2e/sft/run_sft.sh
|
| 129 |
+
- name: Run GSM8K E2E training and resume tests resuming from the checkpoint manager
|
| 130 |
+
run: |
|
| 131 |
+
ray stop --force
|
| 132 |
+
LORA_RANK=32 RESUME_MODE=auto TOTAL_TRAIN_STEP=2 bash tests/special_e2e/sft/run_sft.sh
|
| 133 |
+
- name: Running GSM8K E2E training tests with multiturn and various configs and compare results
|
| 134 |
+
run: |
|
| 135 |
+
ray stop --force
|
| 136 |
+
rm -rf ~/verl/test/log
|
| 137 |
+
mkdir -p ~/verl/test/log
|
| 138 |
+
export VERL_FILE_LOGGER_ROOT=~/verl/test/log
|
| 139 |
+
# test with single gpu as golden
|
| 140 |
+
echo "run with single gpu as golden"
|
| 141 |
+
BACKEND=fsdp SP_SIZE=1 FSDP_SIZE=1 NUM_GPUS=1 FSDP_STRATEGY=fsdp VERL_FILE_LOGGER_PATH=~/verl/test/log/golden.jsonl bash tests/special_e2e/sft/run_sft_engine.sh
|
| 142 |
+
# test with fsdp 1
|
| 143 |
+
echo "run with sp2 fsdp_size2 num_gpus8 fsdp_strategy fsdp pad_mode no_padding"
|
| 144 |
+
BACKEND=fsdp SP_SIZE=2 FSDP_SIZE=2 NUM_GPUS=8 FSDP_STRATEGY=fsdp PAD_MODE=no_padding bash tests/special_e2e/sft/run_sft_engine.sh
|
| 145 |
+
# test with fsdp 1 use_remove_padding and pad_mode no_padding
|
| 146 |
+
echo "run with sp4 fsdp_size4 num_gpus8 fsdp_strategy fsdp pad_mode no_padding use_remove_padding False"
|
| 147 |
+
BACKEND=fsdp SP_SIZE=1 FSDP_SIZE=-1 NUM_GPUS=8 FSDP_STRATEGY=fsdp PAD_MODE=no_padding USE_REMOVE_PADDING=False bash tests/special_e2e/sft/run_sft_engine.sh
|
| 148 |
+
# test with fsdp 2
|
| 149 |
+
echo "run with sp2 fsdp_size2 num_gpus8 fsdp_strategy fsdp2"
|
| 150 |
+
BACKEND=fsdp SP_SIZE=2 FSDP_SIZE=2 NUM_GPUS=8 FSDP_STRATEGY=fsdp2 bash tests/special_e2e/sft/run_sft_engine.sh
|
| 151 |
+
# test with veomni
|
| 152 |
+
echo "run with sp2 fsdp_size4 num_gpus8 fsdp_strategy fsdp2"
|
| 153 |
+
BACKEND=veomni SP_SIZE=2 FSDP_SIZE=4 NUM_GPUS=8 FSDP_STRATEGY=fsdp2 bash tests/special_e2e/sft/run_sft_engine.sh
|
| 154 |
+
# test with megatron
|
| 155 |
+
echo "run with tp2 pp2 vpp2 cp2 num_gpus8"
|
| 156 |
+
BACKEND=megatron TP_SIZE=2 PP_SIZE=2 VPP_SIZE=NULL CP_SIZE=2 NUM_GPUS=8 bash tests/special_e2e/sft/run_sft_engine.sh
|
| 157 |
+
# test with cp in ray
|
| 158 |
+
echo "run with tp2 pp2 vpp2 cp2 num_gpus8 mode=ray"
|
| 159 |
+
BACKEND=megatron TP_SIZE=2 PP_SIZE=2 VPP_SIZE=NULL CP_SIZE=2 NUM_GPUS=8 mode=ray bash tests/special_e2e/sft/run_sft_engine.sh
|
| 160 |
+
rm -rf ~/verl/test/log
|
.github/workflows/e2e_sft_vlm.yml
ADDED
|
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: e2e_sft_vlm
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
push:
|
| 38 |
+
branches:
|
| 39 |
+
- main
|
| 40 |
+
- v0.*
|
| 41 |
+
pull_request:
|
| 42 |
+
branches:
|
| 43 |
+
- main
|
| 44 |
+
- v0.*
|
| 45 |
+
paths:
|
| 46 |
+
- "**/*.py"
|
| 47 |
+
# Other entrypoints
|
| 48 |
+
- "!examples/**"
|
| 49 |
+
- "!tests/**"
|
| 50 |
+
- "!verl/trainer/main_*.py"
|
| 51 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 52 |
+
|
| 53 |
+
# Megatron
|
| 54 |
+
- "!verl/workers/**/megatron_*.py"
|
| 55 |
+
# Entrypoints
|
| 56 |
+
- ".github/workflows/e2e_sft_vlm.yml"
|
| 57 |
+
- "examples/data_preprocess/gsm8k.py"
|
| 58 |
+
- "tests/special_e2e/sft"
|
| 59 |
+
- "verl/trainer/fsdp_sft_trainer.py"
|
| 60 |
+
- "verl/trainer/config/sft_trainer.yaml"
|
| 61 |
+
|
| 62 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 63 |
+
concurrency:
|
| 64 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 65 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 66 |
+
|
| 67 |
+
# Declare permissions just read content.
|
| 68 |
+
permissions:
|
| 69 |
+
contents: read
|
| 70 |
+
|
| 71 |
+
env:
|
| 72 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
|
| 73 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 74 |
+
|
| 75 |
+
jobs:
|
| 76 |
+
setup:
|
| 77 |
+
if: github.repository_owner == 'verl-project'
|
| 78 |
+
runs-on: ubuntu-latest
|
| 79 |
+
outputs:
|
| 80 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 81 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 82 |
+
steps:
|
| 83 |
+
- uses: actions/checkout@v4
|
| 84 |
+
- id: create-runner
|
| 85 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 86 |
+
with:
|
| 87 |
+
mode: "create"
|
| 88 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 89 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 90 |
+
e2e_sft_vlm:
|
| 91 |
+
needs: setup
|
| 92 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 93 |
+
timeout-minutes: 30 # Increase this timeout value as needed
|
| 94 |
+
env:
|
| 95 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 96 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 97 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 98 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 99 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 100 |
+
steps:
|
| 101 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 102 |
+
with:
|
| 103 |
+
fetch-depth: 0
|
| 104 |
+
- name: Install the current repository
|
| 105 |
+
run: |
|
| 106 |
+
pip3 install peft
|
| 107 |
+
pip3 install -r requirements-test.txt
|
| 108 |
+
pip3 install --no-deps -e .
|
| 109 |
+
pip3 install git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.4
|
| 110 |
+
- name: Prepare pokemon-gpt4o-captions dataset
|
| 111 |
+
run: |
|
| 112 |
+
ray stop --force
|
| 113 |
+
python3 examples/data_preprocess/pokemon.py --local_dataset_path ${HOME}/models/hf_data/pokemon-gpt4o-captions
|
| 114 |
+
- name: Running Pokemon E2E training tests with multiturn and various configs and compare results
|
| 115 |
+
run: |
|
| 116 |
+
MODEL_ID=Qwen/Qwen3-VL-2B-Instruct DATASET_DIR=~/data/pokemon-gpt4o-captions VPP_SIZE=null bash tests/special_e2e/sft/test_sft_engine_all.sh
|
| 117 |
+
|
| 118 |
+
cleanup:
|
| 119 |
+
runs-on: ubuntu-latest
|
| 120 |
+
needs: [setup, e2e_sft_vlm]
|
| 121 |
+
if: always()
|
| 122 |
+
steps:
|
| 123 |
+
- id: destroy-runner
|
| 124 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 125 |
+
with:
|
| 126 |
+
mode: "destroy"
|
| 127 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 128 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/gpu_unit_tests.yml
ADDED
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: GPU unit tests
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
push:
|
| 38 |
+
branches:
|
| 39 |
+
- main
|
| 40 |
+
- v0.4.x
|
| 41 |
+
paths:
|
| 42 |
+
- "**/*.py"
|
| 43 |
+
- .github/workflows/gpu_unit_tests.yml
|
| 44 |
+
pull_request:
|
| 45 |
+
branches:
|
| 46 |
+
- main
|
| 47 |
+
- v0.4.x
|
| 48 |
+
paths:
|
| 49 |
+
# The order that you define paths patterns matters:
|
| 50 |
+
# A matching negative pattern (prefixed with !) after a positive match will exclude the path.
|
| 51 |
+
# A matching positive pattern after a negative match will include the path again.
|
| 52 |
+
- "**/*.py"
|
| 53 |
+
# Other entrypoints
|
| 54 |
+
- "!examples/**"
|
| 55 |
+
- "!verl/trainer/main_*.py"
|
| 56 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 57 |
+
# Entrypoints
|
| 58 |
+
- .github/workflows/gpu_unit_tests.yml
|
| 59 |
+
- "tests/**test_*.py"
|
| 60 |
+
# Ignore CPU tests
|
| 61 |
+
- "!tests/*_on_cpu.py"
|
| 62 |
+
|
| 63 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 64 |
+
concurrency:
|
| 65 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 66 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 67 |
+
|
| 68 |
+
# Declare permissions just read content.
|
| 69 |
+
permissions:
|
| 70 |
+
contents: read
|
| 71 |
+
|
| 72 |
+
env:
|
| 73 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
|
| 74 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 75 |
+
|
| 76 |
+
jobs:
|
| 77 |
+
setup:
|
| 78 |
+
if: github.repository_owner == 'verl-project'
|
| 79 |
+
runs-on: ubuntu-latest
|
| 80 |
+
outputs:
|
| 81 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 82 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 83 |
+
steps:
|
| 84 |
+
- uses: actions/checkout@v4
|
| 85 |
+
- id: create-runner
|
| 86 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 87 |
+
with:
|
| 88 |
+
mode: "create"
|
| 89 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 90 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 91 |
+
|
| 92 |
+
gpu_unit_tests:
|
| 93 |
+
if: github.repository_owner == 'verl-project'
|
| 94 |
+
needs: setup
|
| 95 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 96 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 97 |
+
env:
|
| 98 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 99 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 100 |
+
NO_PROXY: "localhost,127.0.0.1"
|
| 101 |
+
HF_HUB_ENABLE_HF_TRANSFER: 1
|
| 102 |
+
steps:
|
| 103 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 104 |
+
with:
|
| 105 |
+
fetch-depth: 0
|
| 106 |
+
- name: Install the current repository
|
| 107 |
+
run: |
|
| 108 |
+
pip3 install hf_transfer
|
| 109 |
+
pip3 install -r requirements-test.txt
|
| 110 |
+
pip3 install --no-deps -e .
|
| 111 |
+
pip3 install cupy-cuda12x==13.6.0 pytest-asyncio
|
| 112 |
+
pip3 install --ignore-installed blinker
|
| 113 |
+
pip3 install --ignore-installed mlflow "numpy<2.0"
|
| 114 |
+
- name: Run all GPU unit tests
|
| 115 |
+
run: |
|
| 116 |
+
pytest -s -x --ignore-glob="*on_npu.py" --ignore-glob="*test_special_*.py" --ignore-glob='*on_cpu.py' --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob='tests/special*' --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" --ignore-glob="*test_shared_memory*" --ignore-glob="tests/workers/rollout/rollout_trtllm" --ignore-glob="*test_bucketed_weight_transfer*" tests/
|
| 117 |
+
- name: Testing LinearCrossEntropyTP Correctness, Computation Time and Memory Consumption
|
| 118 |
+
run: |
|
| 119 |
+
LOW_MEMORY=True torchrun --standalone --nnodes=1 --nproc-per-node=8 tests/utils/test_special_linear_cross_entropy_tp.py
|
| 120 |
+
- name: Testing FSDP2 actor functionality
|
| 121 |
+
run: |
|
| 122 |
+
torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/actor/test_special_dp_actor.py
|
| 123 |
+
- name: Testing FSDP2 critic functionality
|
| 124 |
+
run: |
|
| 125 |
+
torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/critic/test_special_dp_critic.py
|
| 126 |
+
|
| 127 |
+
cleanup:
|
| 128 |
+
runs-on: ubuntu-latest
|
| 129 |
+
needs: [setup, gpu_unit_tests]
|
| 130 |
+
if: always()
|
| 131 |
+
steps:
|
| 132 |
+
- id: destroy-runner
|
| 133 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 134 |
+
with:
|
| 135 |
+
mode: "destroy"
|
| 136 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 137 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/model.yml
ADDED
|
@@ -0,0 +1,184 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
# name: Check PR Title
|
| 32 |
+
|
| 33 |
+
name: model
|
| 34 |
+
|
| 35 |
+
on:
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
pull_request:
|
| 43 |
+
branches:
|
| 44 |
+
- main
|
| 45 |
+
- v0.*
|
| 46 |
+
paths:
|
| 47 |
+
- "verl/**/*.py"
|
| 48 |
+
# Entrypoints
|
| 49 |
+
- ".github/workflows/model.yml"
|
| 50 |
+
- "tests/special_distributed/test_fsdp_ckpt.py"
|
| 51 |
+
- "tests/special_distributed/test_tensor_dict.py"
|
| 52 |
+
- "tests/models/**"
|
| 53 |
+
- "tests/special_distributed/run_all.sh"
|
| 54 |
+
|
| 55 |
+
# Declare permissions just read content.
|
| 56 |
+
permissions:
|
| 57 |
+
contents: read
|
| 58 |
+
|
| 59 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 60 |
+
concurrency:
|
| 61 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 62 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 63 |
+
|
| 64 |
+
env:
|
| 65 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 66 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 67 |
+
|
| 68 |
+
jobs:
|
| 69 |
+
setup:
|
| 70 |
+
if: github.repository_owner == 'verl-project'
|
| 71 |
+
runs-on: ubuntu-latest
|
| 72 |
+
outputs:
|
| 73 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 74 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 75 |
+
steps:
|
| 76 |
+
- uses: actions/checkout@v4
|
| 77 |
+
- id: create-runner
|
| 78 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 79 |
+
with:
|
| 80 |
+
mode: "create"
|
| 81 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 82 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 83 |
+
|
| 84 |
+
model_rmpad:
|
| 85 |
+
needs: setup
|
| 86 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 87 |
+
timeout-minutes: 20 # Increase this timeout value as needed
|
| 88 |
+
env:
|
| 89 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 90 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 91 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 92 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 93 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 94 |
+
steps:
|
| 95 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 96 |
+
with:
|
| 97 |
+
fetch-depth: 0
|
| 98 |
+
- name: Install the current repository and upgrade to latest transformers(4.54.0)/flash_attn, transformers 4.55.0 has strange behavior with model backward
|
| 99 |
+
run: |
|
| 100 |
+
pip3 install -r requirements-test.txt
|
| 101 |
+
pip3 install --no-deps -e .
|
| 102 |
+
pip3 install --upgrade "transformers<5.0.0"
|
| 103 |
+
- name: Running rmpad model tests on 8 L20 GPUs + flash_attn 2.5.8
|
| 104 |
+
run: |
|
| 105 |
+
pytest -s tests/models/test_transformer.py
|
| 106 |
+
- name: Running rmpad model tests on 8 L20 GPUs + latest flash_attn
|
| 107 |
+
run: |
|
| 108 |
+
pytest -s tests/models/test_transformer.py
|
| 109 |
+
- name: Running FSDP rmpad model tests on 8 L20 GPUs + latest flash_attn
|
| 110 |
+
run: |
|
| 111 |
+
STRATEGY=fsdp torchrun --nproc_per_node=8 tests/special_distributed/test_fsdp_ckpt.py
|
| 112 |
+
- name: Running transformers ulysses tests on 8 L20 GPUs + latest transformers
|
| 113 |
+
run: |
|
| 114 |
+
torchrun --nproc_per_node=8 -m pytest tests/models/test_transformers_ulysses.py
|
| 115 |
+
- name: Running transformers ulysses tests on 8 L20 GPUs + transformers 4.54.1
|
| 116 |
+
run: |
|
| 117 |
+
pip3 install transformers==4.54.1
|
| 118 |
+
torchrun --nproc_per_node=8 -m pytest tests/models/test_transformers_ulysses.py
|
| 119 |
+
- name: Run distributed test
|
| 120 |
+
run: |
|
| 121 |
+
bash tests/special_distributed/run_all.sh
|
| 122 |
+
|
| 123 |
+
# TODO: Move this back to model_rmpad once FSDP2 is stable.
|
| 124 |
+
# NOTE: List as an independent job to make rerun easier.
|
| 125 |
+
model_rmpad_fsdp2_unstable:
|
| 126 |
+
needs: setup
|
| 127 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 128 |
+
timeout-minutes: 20 # Increase this timeout value as needed
|
| 129 |
+
env:
|
| 130 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 131 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 132 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 133 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 134 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 135 |
+
steps:
|
| 136 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 137 |
+
with:
|
| 138 |
+
fetch-depth: 0
|
| 139 |
+
- name: Install the current repository and upgrade to latest transformers/flash_attn
|
| 140 |
+
run: |
|
| 141 |
+
pip3 install -r requirements-test.txt
|
| 142 |
+
pip3 install --no-deps -e .
|
| 143 |
+
- name: Running FSDP2 rmpad model tests on 8 L20 GPUs + latest flash_attn
|
| 144 |
+
run: |
|
| 145 |
+
STRATEGY=fsdp2 torchrun --nproc_per_node=8 tests/special_distributed/test_fsdp_ckpt.py
|
| 146 |
+
|
| 147 |
+
model_engine:
|
| 148 |
+
needs: setup
|
| 149 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 150 |
+
timeout-minutes: 20 # Increase this timeout value as needed
|
| 151 |
+
env:
|
| 152 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 153 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 154 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 155 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 156 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 157 |
+
steps:
|
| 158 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 159 |
+
with:
|
| 160 |
+
fetch-depth: 0
|
| 161 |
+
- name: Install the current repository
|
| 162 |
+
run: |
|
| 163 |
+
pip3 install -r requirements-test.txt
|
| 164 |
+
pip3 install --no-deps -e .
|
| 165 |
+
- name: Download model config files
|
| 166 |
+
run: |
|
| 167 |
+
hf download Qwen/Qwen2.5-0.5B-Instruct --local-dir $HOME/models/Qwen/Qwen2.5-0.5B-Instruct
|
| 168 |
+
|
| 169 |
+
- name: Running mcore engine tests on 8 L20 GPUs
|
| 170 |
+
run: |
|
| 171 |
+
ray stop --force
|
| 172 |
+
pytest -s -x tests/models/test_engine.py
|
| 173 |
+
|
| 174 |
+
cleanup:
|
| 175 |
+
runs-on: ubuntu-latest
|
| 176 |
+
needs: [setup, model_rmpad, model_rmpad_fsdp2_unstable, model_engine]
|
| 177 |
+
if: always()
|
| 178 |
+
steps:
|
| 179 |
+
- id: destroy-runner
|
| 180 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 181 |
+
with:
|
| 182 |
+
mode: "destroy"
|
| 183 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 184 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/model_ascend.yml
ADDED
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
# name: Check PR Title
|
| 32 |
+
|
| 33 |
+
name: model_ascend
|
| 34 |
+
|
| 35 |
+
on:
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
pull_request:
|
| 43 |
+
branches:
|
| 44 |
+
- main
|
| 45 |
+
- v0.*
|
| 46 |
+
paths:
|
| 47 |
+
- "verl/**/*.py"
|
| 48 |
+
# Entrypoints
|
| 49 |
+
- ".github/workflows/model_ascend.yml"
|
| 50 |
+
- "tests/special_distributed/test_fsdp_ckpt.py"
|
| 51 |
+
- "tests/special_distributed/test_tensor_dict.py"
|
| 52 |
+
- "tests/models/**"
|
| 53 |
+
- "tests/special_distributed/run_all.sh"
|
| 54 |
+
|
| 55 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 56 |
+
concurrency:
|
| 57 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 58 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 59 |
+
|
| 60 |
+
permissions:
|
| 61 |
+
contents: read
|
| 62 |
+
|
| 63 |
+
jobs:
|
| 64 |
+
model_rmpad_ascend:
|
| 65 |
+
if: github.repository_owner == 'verl-project'
|
| 66 |
+
runs-on: linux-aarch64-a2b3-8
|
| 67 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 68 |
+
container:
|
| 69 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 70 |
+
options: >-
|
| 71 |
+
--shm-size 16g
|
| 72 |
+
env:
|
| 73 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 74 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 75 |
+
steps:
|
| 76 |
+
- name: Check npu and CANN info
|
| 77 |
+
run: |
|
| 78 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 79 |
+
npu-smi info
|
| 80 |
+
- name: Check initial pip list from image
|
| 81 |
+
run: |
|
| 82 |
+
pip list
|
| 83 |
+
- name: Checkout verl-project/verl repo
|
| 84 |
+
uses: actions/checkout@v4
|
| 85 |
+
with:
|
| 86 |
+
fetch-depth: 0
|
| 87 |
+
clean: true
|
| 88 |
+
- name: Install the current repository
|
| 89 |
+
run: |
|
| 90 |
+
pip install -r requirements-npu.txt
|
| 91 |
+
pip install --no-deps -e .[test]
|
| 92 |
+
- name: Check final pip list
|
| 93 |
+
run: |
|
| 94 |
+
pip list
|
| 95 |
+
- name: Prepare weights
|
| 96 |
+
run: |
|
| 97 |
+
ln -s /root/.cache/models ~/models
|
| 98 |
+
- name: Running rmpad model tests on 8 NPUs
|
| 99 |
+
run: |
|
| 100 |
+
pytest -s tests/models/test_transformer.py
|
| 101 |
+
- name: Running FSDP rmpad model tests on 8 NPUs
|
| 102 |
+
run: |
|
| 103 |
+
STRATEGY=fsdp torchrun --nproc_per_node=8 tests/special_distributed/test_fsdp_ckpt.py
|
| 104 |
+
- name: Running transformers ulysses tests on 8 NPUs
|
| 105 |
+
run: |
|
| 106 |
+
torchrun --nproc_per_node=8 -m pytest tests/models/test_transformers_ulysses.py
|
| 107 |
+
- name: Run distributed test
|
| 108 |
+
run: |
|
| 109 |
+
bash tests/special_distributed/run_all.sh
|
| 110 |
+
|
| 111 |
+
# TODO: Move this back to model_rmpad once FSDP2 is stable.
|
| 112 |
+
# NOTE: List as an independent job to make rerun easier.
|
| 113 |
+
model_rmpad_fsdp2_unstable_ascend:
|
| 114 |
+
if: github.repository_owner == 'verl-project'
|
| 115 |
+
runs-on: linux-aarch64-a2b3-8
|
| 116 |
+
timeout-minutes: 60
|
| 117 |
+
container:
|
| 118 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 119 |
+
options: >-
|
| 120 |
+
--shm-size 16g
|
| 121 |
+
env:
|
| 122 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 123 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 124 |
+
steps:
|
| 125 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 126 |
+
with:
|
| 127 |
+
fetch-depth: 0
|
| 128 |
+
- name: Install the current repository
|
| 129 |
+
run: |
|
| 130 |
+
pip install -r requirements-npu.txt
|
| 131 |
+
pip install --no-deps -e .[test]
|
| 132 |
+
- name: Prepare weights
|
| 133 |
+
run: |
|
| 134 |
+
ln -s /root/.cache/models ~/models
|
| 135 |
+
- name: Running FSDP2 rmpad model tests on 8 NPUs
|
| 136 |
+
run: |
|
| 137 |
+
STRATEGY=fsdp2 torchrun --nproc_per_node=8 tests/special_distributed/test_fsdp_ckpt.py
|
.github/workflows/nightly_ascend.yml
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: nightly_ci_ascend
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
# For push, for now only anti-patterns are specified so it is more conservative
|
| 38 |
+
# and achieves higher coverage.
|
| 39 |
+
schedule:
|
| 40 |
+
- cron: "0 17 * * *"
|
| 41 |
+
|
| 42 |
+
# Declare permissions just read content.
|
| 43 |
+
permissions:
|
| 44 |
+
contents: read
|
| 45 |
+
|
| 46 |
+
jobs:
|
| 47 |
+
# Test ppo qwen3-8b fsdp+vllm
|
| 48 |
+
nightlyCI_ppo-qwen3-8b-fsdp-vllm_ascend:
|
| 49 |
+
if: github.repository_owner == 'verl-project'
|
| 50 |
+
runs-on: linux-aarch64-a2b3-8
|
| 51 |
+
timeout-minutes: 180 # Increase this timeout value as needed
|
| 52 |
+
container:
|
| 53 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 54 |
+
options: >-
|
| 55 |
+
--shm-size 16g
|
| 56 |
+
env:
|
| 57 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 58 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 59 |
+
steps:
|
| 60 |
+
- name: Check npu and CANN info
|
| 61 |
+
run: |
|
| 62 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 63 |
+
npu-smi info
|
| 64 |
+
- name: Check initial pip list from image
|
| 65 |
+
run: |
|
| 66 |
+
pip list
|
| 67 |
+
- name: Checkout verl-project/verl repo
|
| 68 |
+
uses: actions/checkout@v4
|
| 69 |
+
with:
|
| 70 |
+
fetch-depth: 0
|
| 71 |
+
clean: true
|
| 72 |
+
- name: Install the current repository
|
| 73 |
+
run: |
|
| 74 |
+
pip install -r requirements-npu.txt
|
| 75 |
+
pip install --no-deps -e .
|
| 76 |
+
- name: Check final pip list
|
| 77 |
+
run: |
|
| 78 |
+
pip list
|
| 79 |
+
- name: Prepare weights
|
| 80 |
+
run: |
|
| 81 |
+
ln -s /root/.cache/models ~/models
|
| 82 |
+
- name: Prepare GSM8K dataset
|
| 83 |
+
run: |
|
| 84 |
+
python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
|
| 85 |
+
- name: Running nightlyCI_ppo-qwen3-8b-fsdp-vllm_ascend
|
| 86 |
+
run: |
|
| 87 |
+
ray stop --force
|
| 88 |
+
bash tests/special_npu/nightly_ci_ascend/run_ppo_qwen3-8b_fsdp_npu.sh
|
| 89 |
+
|
| 90 |
+
# Test grpo qwen25-7b-Instruct fsdp+vllm
|
| 91 |
+
nightlyCI_grpo-qwen25-7b-Instruct-fsdp-vllm_ascend:
|
| 92 |
+
if: github.repository_owner == 'verl-project'
|
| 93 |
+
runs-on: linux-aarch64-a2b3-8
|
| 94 |
+
timeout-minutes: 180 # Increase this timeout value as needed
|
| 95 |
+
container:
|
| 96 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 97 |
+
options: >-
|
| 98 |
+
--shm-size 16g
|
| 99 |
+
env:
|
| 100 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 101 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 102 |
+
steps:
|
| 103 |
+
- name: Check npu and CANN info
|
| 104 |
+
run: |
|
| 105 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 106 |
+
npu-smi info
|
| 107 |
+
- name: Check initial pip list from image
|
| 108 |
+
run: |
|
| 109 |
+
pip list
|
| 110 |
+
- name: Checkout verl-project/verl repo
|
| 111 |
+
uses: actions/checkout@v4
|
| 112 |
+
with:
|
| 113 |
+
fetch-depth: 0
|
| 114 |
+
clean: true
|
| 115 |
+
- name: Install the current repository
|
| 116 |
+
run: |
|
| 117 |
+
pip install -r requirements-npu.txt
|
| 118 |
+
pip install --no-deps -e .
|
| 119 |
+
- name: Check final pip list
|
| 120 |
+
run: |
|
| 121 |
+
pip list
|
| 122 |
+
- name: Prepare weights
|
| 123 |
+
run: |
|
| 124 |
+
ln -s /root/.cache/models ~/models
|
| 125 |
+
- name: Prepare GSM8K dataset
|
| 126 |
+
run: |
|
| 127 |
+
python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
|
| 128 |
+
- name: Running nightlyCI_grpo-qwen25-7b-Instruct-fsdp-vllm_ascend
|
| 129 |
+
run: |
|
| 130 |
+
ray stop --force
|
| 131 |
+
bash tests/special_npu/nightly_ci_ascend/run_grpo_qwen25-7b-instruct_fsdp_npu.sh
|
| 132 |
+
|
| 133 |
+
# Test grpo qwen25-vl-3b-Instruct fsdp+vllm
|
| 134 |
+
nightlyCI_grpo-qwen25-vl-3b-Instruct-fsdp-vllm_ascend:
|
| 135 |
+
if: github.repository_owner == 'verl-project'
|
| 136 |
+
runs-on: linux-aarch64-a2b3-8
|
| 137 |
+
timeout-minutes: 180 # Increase this timeout value as needed
|
| 138 |
+
container:
|
| 139 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 140 |
+
options: >-
|
| 141 |
+
--shm-size 16g
|
| 142 |
+
env:
|
| 143 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 144 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 145 |
+
steps:
|
| 146 |
+
- name: Check npu and CANN info
|
| 147 |
+
run: |
|
| 148 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 149 |
+
npu-smi info
|
| 150 |
+
- name: Check initial pip list from image
|
| 151 |
+
run: |
|
| 152 |
+
pip list
|
| 153 |
+
- name: Checkout verl-project/verl repo
|
| 154 |
+
uses: actions/checkout@v4
|
| 155 |
+
with:
|
| 156 |
+
fetch-depth: 0
|
| 157 |
+
clean: true
|
| 158 |
+
- name: Install the current repository
|
| 159 |
+
run: |
|
| 160 |
+
pip install -r requirements-npu.txt
|
| 161 |
+
pip install --no-deps -e .
|
| 162 |
+
- name: Check final pip list
|
| 163 |
+
run: |
|
| 164 |
+
pip list
|
| 165 |
+
- name: Prepare weights
|
| 166 |
+
run: |
|
| 167 |
+
ln -s /root/.cache/models ~/models
|
| 168 |
+
- name: Preprocess geo3k dataset
|
| 169 |
+
run: |
|
| 170 |
+
python examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/.cache/datasets/hiyouga/geometry3k
|
| 171 |
+
- name: Running nightlyCI_grpo-qwen25-vl-3b-Instruct-fsdp-vllm_ascend
|
| 172 |
+
run: |
|
| 173 |
+
ray stop --force
|
| 174 |
+
bash tests/special_npu/nightly_ci_ascend/run_grpo_qwen25-vl-3b-instruct_fsdp_npu.sh
|
.github/workflows/npu_unit_tests.yml
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - `npu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix on ascend device.
|
| 29 |
+
# - Since cpu/gpu/npu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 30 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 31 |
+
# - new tests are added to workflow mentioned in 2.
|
| 32 |
+
|
| 33 |
+
name: NPU unit tests
|
| 34 |
+
|
| 35 |
+
on:
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
paths:
|
| 43 |
+
- "**/*.py"
|
| 44 |
+
- .github/workflows/npu_unit_tests.yml
|
| 45 |
+
pull_request:
|
| 46 |
+
branches:
|
| 47 |
+
- main
|
| 48 |
+
paths:
|
| 49 |
+
# The order that you define paths patterns matters:
|
| 50 |
+
# A matching negative pattern (prefixed with !) after a positive match will exclude the path.
|
| 51 |
+
# A matching positive pattern after a negative match will include the path again.
|
| 52 |
+
- "**/*.py"
|
| 53 |
+
# Other entrypoints
|
| 54 |
+
- "!examples/**"
|
| 55 |
+
- "!verl/trainer/main_*.py"
|
| 56 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 57 |
+
- "!recipe/**"
|
| 58 |
+
# Entrypoints
|
| 59 |
+
- .github/workflows/npu_unit_tests.yml
|
| 60 |
+
- "tests/**test_*.py"
|
| 61 |
+
# Ignore CPU tests
|
| 62 |
+
- "!tests/*_on_cpu.py"
|
| 63 |
+
|
| 64 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 65 |
+
concurrency:
|
| 66 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 67 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 68 |
+
|
| 69 |
+
# Declare permissions just read content.
|
| 70 |
+
permissions:
|
| 71 |
+
contents: read
|
| 72 |
+
|
| 73 |
+
jobs:
|
| 74 |
+
npu_unit_tests:
|
| 75 |
+
if: github.repository_owner == 'verl-project'
|
| 76 |
+
runs-on: linux-aarch64-a2b3-8
|
| 77 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 78 |
+
container:
|
| 79 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 80 |
+
options: >-
|
| 81 |
+
--shm-size 16g
|
| 82 |
+
env:
|
| 83 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 84 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 85 |
+
steps:
|
| 86 |
+
- name: Check npu and CANN info
|
| 87 |
+
run: |
|
| 88 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 89 |
+
npu-smi info
|
| 90 |
+
- name: Check initial pip list from image
|
| 91 |
+
run: |
|
| 92 |
+
pip list
|
| 93 |
+
- name: Checkout volcengine/verl repo
|
| 94 |
+
uses: actions/checkout@v4
|
| 95 |
+
with:
|
| 96 |
+
fetch-depth: 0
|
| 97 |
+
clean: true
|
| 98 |
+
- name: Install the current repository
|
| 99 |
+
run: |
|
| 100 |
+
pip install -r requirements-npu.txt
|
| 101 |
+
pip install --no-deps -e .[test]
|
| 102 |
+
pip install mlflow pytest-asyncio
|
| 103 |
+
- name: Check final pip list
|
| 104 |
+
run: |
|
| 105 |
+
pip list
|
| 106 |
+
- name: Prepare weights
|
| 107 |
+
run: |
|
| 108 |
+
ln -s /root/.cache/models ~/models
|
| 109 |
+
- name: Run all NPU unit tests
|
| 110 |
+
run: |
|
| 111 |
+
pytest -s -x --ignore-glob="*test_special_*.py" --ignore-glob="*on_cpu.py" --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob="tests/special*" --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" --ignore-glob="*test_rvdz*" --ignore-glob="*test_ray_collectives*" --ignore-glob="*test_nvtx_profile*" --ignore-glob="tests/checkpoint_engine" --ignore-glob="*test_shared_memory*" --ignore-glob="tests/workers/rollout/rollout_trtllm" --ignore-glob="*test_fsdp_lora_merge*" --ignore-glob="*test_activation_offload*" --ignore-glob="*test_normalize_peft_param_name.py*" tests/
|
| 112 |
+
- name: Testing activation offload
|
| 113 |
+
run: |
|
| 114 |
+
pytest -s -x tests/utils/test_activation_offload.py
|
| 115 |
+
- name: Testing normalize peft param name
|
| 116 |
+
run: |
|
| 117 |
+
pytest -s -x tests/utils/test_normalize_peft_param_name.py
|
| 118 |
+
- name: Testing FSDP2 actor functionality
|
| 119 |
+
run: |
|
| 120 |
+
torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/actor/test_special_dp_actor.py
|
| 121 |
+
- name: Testing FSDP2 critic functionality
|
| 122 |
+
run: |
|
| 123 |
+
torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/critic/test_special_dp_critic.py
|
| 124 |
+
- name: Running NPU profiling unit tests
|
| 125 |
+
run: |
|
| 126 |
+
pytest -s -x tests/utils/test_special_mstx_profile.py
|
.github/workflows/pre-commit.yml
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# c.f. https://github.com/pre-commit/action?tab=readme-ov-file#using-this-action
|
| 2 |
+
name: pre-commit
|
| 3 |
+
|
| 4 |
+
# No need to avoid / cancel lightweight pre-commit jobs
|
| 5 |
+
on:
|
| 6 |
+
schedule:
|
| 7 |
+
- cron: "0 0 * * 0"
|
| 8 |
+
pull_request:
|
| 9 |
+
push:
|
| 10 |
+
branches:
|
| 11 |
+
- main
|
| 12 |
+
- v0.*
|
| 13 |
+
# Allow manual triggering
|
| 14 |
+
workflow_dispatch:
|
| 15 |
+
|
| 16 |
+
# Declare permissions just read content.
|
| 17 |
+
permissions:
|
| 18 |
+
contents: read
|
| 19 |
+
|
| 20 |
+
jobs:
|
| 21 |
+
pre-commit:
|
| 22 |
+
runs-on: ubuntu-latest
|
| 23 |
+
strategy:
|
| 24 |
+
matrix:
|
| 25 |
+
python-version: ["3.12"]
|
| 26 |
+
steps:
|
| 27 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 28 |
+
- name: Set up Python ${{ matrix.python-version }}
|
| 29 |
+
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
|
| 30 |
+
with:
|
| 31 |
+
python-version: ${{ matrix.python-version }}
|
| 32 |
+
- name: Install the current repository
|
| 33 |
+
run: |
|
| 34 |
+
pip install pre-commit hydra-core
|
| 35 |
+
pip install --no-deps -e .
|
| 36 |
+
- name: Set ruff --output-format=github
|
| 37 |
+
run: |
|
| 38 |
+
sed -i 's/--output-format=full/--output-format=github/' .pre-commit-config.yaml
|
| 39 |
+
git add .pre-commit-config.yaml
|
| 40 |
+
# Check "--all-files" by default
|
| 41 |
+
- uses: pre-commit/action@v3.0.1
|
.github/workflows/precommit-autofix.yml
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: scheduled pre-commit autofix
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
schedule:
|
| 5 |
+
# Every hour
|
| 6 |
+
- cron: "0 * * * *"
|
| 7 |
+
workflow_dispatch:
|
| 8 |
+
|
| 9 |
+
permissions:
|
| 10 |
+
contents: write
|
| 11 |
+
pull-requests: write
|
| 12 |
+
|
| 13 |
+
jobs:
|
| 14 |
+
precommit:
|
| 15 |
+
if: github.repository_owner == 'verl-project'
|
| 16 |
+
runs-on: ubuntu-latest
|
| 17 |
+
|
| 18 |
+
steps:
|
| 19 |
+
- name: Checkout repository
|
| 20 |
+
uses: actions/checkout@v4
|
| 21 |
+
with:
|
| 22 |
+
fetch-depth: 0
|
| 23 |
+
|
| 24 |
+
- name: Set up Python
|
| 25 |
+
uses: actions/setup-python@v5
|
| 26 |
+
with:
|
| 27 |
+
python-version: "3.10"
|
| 28 |
+
|
| 29 |
+
- name: Install pre-commit
|
| 30 |
+
run: |
|
| 31 |
+
python -m pip install --upgrade pip
|
| 32 |
+
pip install pre-commit hydra-core
|
| 33 |
+
|
| 34 |
+
- name: Run pre-commit
|
| 35 |
+
run: |
|
| 36 |
+
pre-commit run --all-files || true
|
| 37 |
+
|
| 38 |
+
- name: Create or update PR
|
| 39 |
+
uses: peter-evans/create-pull-request@v6
|
| 40 |
+
with:
|
| 41 |
+
branch: bot/precommit-autofix
|
| 42 |
+
delete-branch: true
|
| 43 |
+
title: "[ci] chore: scheduled pre-commit autofix"
|
| 44 |
+
commit-message: "chore: auto-fix pre-commit issues"
|
| 45 |
+
body: |
|
| 46 |
+
This PR was created automatically by a scheduled GitHub Action.
|
| 47 |
+
|
| 48 |
+
- Runs `pre-commit run --all-files`
|
| 49 |
+
- Triggered hourly
|
| 50 |
+
labels: |
|
| 51 |
+
automated
|
| 52 |
+
pre-commit
|
.github/workflows/reward_model_sglang.yml
ADDED
|
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
# name: Check PR Title
|
| 32 |
+
|
| 33 |
+
name: reward_model_sglang
|
| 34 |
+
|
| 35 |
+
on:
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
pull_request:
|
| 43 |
+
branches:
|
| 44 |
+
- main
|
| 45 |
+
- v0.*
|
| 46 |
+
paths:
|
| 47 |
+
- "verl/**/*.py"
|
| 48 |
+
# Entrypoints
|
| 49 |
+
- ".github/workflows/reward_model_sglang.yml"
|
| 50 |
+
- "tests/experimental/reward_loop/**"
|
| 51 |
+
|
| 52 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 53 |
+
concurrency:
|
| 54 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 55 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 56 |
+
|
| 57 |
+
# Declare permissions just read content.
|
| 58 |
+
permissions:
|
| 59 |
+
contents: read
|
| 60 |
+
|
| 61 |
+
env:
|
| 62 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
|
| 63 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 64 |
+
|
| 65 |
+
jobs:
|
| 66 |
+
setup:
|
| 67 |
+
if: github.repository_owner == 'verl-project'
|
| 68 |
+
runs-on: ubuntu-latest
|
| 69 |
+
outputs:
|
| 70 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 71 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 72 |
+
steps:
|
| 73 |
+
- uses: actions/checkout@v4
|
| 74 |
+
- id: create-runner
|
| 75 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 76 |
+
with:
|
| 77 |
+
mode: "create"
|
| 78 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 79 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 80 |
+
|
| 81 |
+
reward_model_sglang:
|
| 82 |
+
needs: setup
|
| 83 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 84 |
+
timeout-minutes: 30 # Increase this timeout value as needed
|
| 85 |
+
env:
|
| 86 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 87 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 88 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 89 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 90 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 91 |
+
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
|
| 92 |
+
NCCL_SHM_DISABLE: "1"
|
| 93 |
+
NCCL_P2P_DISABLE: "1"
|
| 94 |
+
steps:
|
| 95 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 96 |
+
with:
|
| 97 |
+
fetch-depth: 0
|
| 98 |
+
- name: Install the current repository
|
| 99 |
+
run: |
|
| 100 |
+
pip3 install -r requirements-test.txt
|
| 101 |
+
pip3 install --no-deps -e .
|
| 102 |
+
pip3 install sglang-router==0.2.2
|
| 103 |
+
- name: Prepare gsm8k dataset
|
| 104 |
+
run: |
|
| 105 |
+
ray stop --force
|
| 106 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_dir ${HOME}/data/gsm8k
|
| 107 |
+
- name: Running sglang generative reward model tests on 8 L20 GPUs
|
| 108 |
+
run: |
|
| 109 |
+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
|
| 110 |
+
ROLLOUT_NAME=sglang pytest -s -x tests/experimental/reward_loop/test_reward_model_genrm.py
|
| 111 |
+
- name: Running sglang discriminative reward model tests on 8 L20 GPUs
|
| 112 |
+
run: |
|
| 113 |
+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
|
| 114 |
+
ROLLOUT_NAME=sglang pytest -s -x tests/experimental/reward_loop/test_reward_model_disrm.py
|
| 115 |
+
- name: Running sglang agent loop with reward manager tests on 8 L20 GPUs
|
| 116 |
+
run: |
|
| 117 |
+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
|
| 118 |
+
ROLLOUT_NAME=sglang pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_standalone.py
|
| 119 |
+
- name: Running sglang agent loop with reward model colocate tests on 8 L20 GPUs
|
| 120 |
+
run: |
|
| 121 |
+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
|
| 122 |
+
ROLLOUT_NAME=sglang pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_colocate.py
|
| 123 |
+
|
| 124 |
+
cleanup:
|
| 125 |
+
runs-on: ubuntu-latest
|
| 126 |
+
needs: [setup, reward_model_sglang]
|
| 127 |
+
if: always()
|
| 128 |
+
steps:
|
| 129 |
+
- id: destroy-runner
|
| 130 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 131 |
+
with:
|
| 132 |
+
mode: "destroy"
|
| 133 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 134 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/reward_model_vllm.yml
ADDED
|
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
# name: Check PR Title
|
| 32 |
+
|
| 33 |
+
name: reward_model_vllm
|
| 34 |
+
|
| 35 |
+
on:
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
pull_request:
|
| 43 |
+
branches:
|
| 44 |
+
- main
|
| 45 |
+
- v0.*
|
| 46 |
+
paths:
|
| 47 |
+
- "verl/**/*.py"
|
| 48 |
+
# Entrypoints
|
| 49 |
+
- ".github/workflows/reward_model_vllm.yml"
|
| 50 |
+
- "tests/experimental/reward_loop/**"
|
| 51 |
+
|
| 52 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 53 |
+
concurrency:
|
| 54 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 55 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 56 |
+
|
| 57 |
+
# Declare permissions just read content.
|
| 58 |
+
permissions:
|
| 59 |
+
contents: read
|
| 60 |
+
|
| 61 |
+
env:
|
| 62 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 63 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 64 |
+
|
| 65 |
+
jobs:
|
| 66 |
+
setup:
|
| 67 |
+
if: github.repository_owner == 'verl-project'
|
| 68 |
+
runs-on: ubuntu-latest
|
| 69 |
+
outputs:
|
| 70 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 71 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 72 |
+
steps:
|
| 73 |
+
- uses: actions/checkout@v4
|
| 74 |
+
- id: create-runner
|
| 75 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 76 |
+
with:
|
| 77 |
+
mode: "create"
|
| 78 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 79 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 80 |
+
|
| 81 |
+
reward_model_vllm:
|
| 82 |
+
needs: setup
|
| 83 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 84 |
+
timeout-minutes: 30 # Increase this timeout value as needed
|
| 85 |
+
env:
|
| 86 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 87 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 88 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 89 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 90 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 91 |
+
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
|
| 92 |
+
NCCL_SHM_DISABLE: "1"
|
| 93 |
+
NCCL_P2P_DISABLE: "1"
|
| 94 |
+
steps:
|
| 95 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 96 |
+
with:
|
| 97 |
+
fetch-depth: 0
|
| 98 |
+
- name: Install the current repository
|
| 99 |
+
run: |
|
| 100 |
+
pip3 install -r requirements-test.txt
|
| 101 |
+
pip3 install --no-deps -e .
|
| 102 |
+
- name: Prepare gsm8k dataset
|
| 103 |
+
run: |
|
| 104 |
+
ray stop --force
|
| 105 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_dir ${HOME}/data/gsm8k
|
| 106 |
+
- name: Running vllm generative reward model tests on 8 L20 GPUs
|
| 107 |
+
run: |
|
| 108 |
+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
|
| 109 |
+
ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_reward_model_genrm.py
|
| 110 |
+
- name: Running vllm discriminative reward model tests on 8 L20 GPUs
|
| 111 |
+
run: |
|
| 112 |
+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
|
| 113 |
+
ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_reward_model_disrm.py
|
| 114 |
+
|
| 115 |
+
- name: Running vllm agent loop with reward manager tests on 8 L20 GPUs
|
| 116 |
+
run: |
|
| 117 |
+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
|
| 118 |
+
ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_standalone.py
|
| 119 |
+
- name: Running vllm agent loop with reward model colocate tests on 8 L20 GPUs
|
| 120 |
+
run: |
|
| 121 |
+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
|
| 122 |
+
ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_colocate.py
|
| 123 |
+
|
| 124 |
+
cleanup:
|
| 125 |
+
runs-on: ubuntu-latest
|
| 126 |
+
needs: [setup, reward_model_vllm]
|
| 127 |
+
if: always()
|
| 128 |
+
steps:
|
| 129 |
+
- id: destroy-runner
|
| 130 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 131 |
+
with:
|
| 132 |
+
mode: "destroy"
|
| 133 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 134 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/reward_model_vllm_ascend.yml
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
# name: Check PR Title
|
| 32 |
+
|
| 33 |
+
name: reward_model_vllm_ascend
|
| 34 |
+
|
| 35 |
+
on:
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
pull_request:
|
| 43 |
+
branches:
|
| 44 |
+
- main
|
| 45 |
+
- v0.*
|
| 46 |
+
paths:
|
| 47 |
+
- "verl/**/*.py"
|
| 48 |
+
# Entrypoints
|
| 49 |
+
- ".github/workflows/reward_model_vllm_ascend.yml"
|
| 50 |
+
- "tests/experimental/reward_loop/**"
|
| 51 |
+
|
| 52 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 53 |
+
concurrency:
|
| 54 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 55 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 56 |
+
|
| 57 |
+
# Declare permissions just read content.
|
| 58 |
+
permissions:
|
| 59 |
+
contents: read
|
| 60 |
+
|
| 61 |
+
jobs:
|
| 62 |
+
reward_model_vllm_ascend:
|
| 63 |
+
if: github.repository_owner == 'verl-project'
|
| 64 |
+
runs-on: linux-aarch64-a2b3-8
|
| 65 |
+
timeout-minutes: 60 # Increase this timeout value as needed
|
| 66 |
+
container:
|
| 67 |
+
image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
|
| 68 |
+
options: >-
|
| 69 |
+
--shm-size 16g
|
| 70 |
+
env:
|
| 71 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 72 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 73 |
+
steps:
|
| 74 |
+
- name: Check npu and CANN info
|
| 75 |
+
run: |
|
| 76 |
+
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
| 77 |
+
npu-smi info
|
| 78 |
+
- name: Check initial pip list from image
|
| 79 |
+
run: |
|
| 80 |
+
pip list
|
| 81 |
+
- name: Checkout verl-project/verl repo
|
| 82 |
+
uses: actions/checkout@v4
|
| 83 |
+
with:
|
| 84 |
+
fetch-depth: 0
|
| 85 |
+
clean: true
|
| 86 |
+
- name: Install the current repository
|
| 87 |
+
run: |
|
| 88 |
+
pip install -r requirements-npu.txt
|
| 89 |
+
pip install --no-deps -e .[test]
|
| 90 |
+
- name: Check final pip list
|
| 91 |
+
run: |
|
| 92 |
+
pip list
|
| 93 |
+
- name: Prepare weights
|
| 94 |
+
run: |
|
| 95 |
+
ln -s /root/.cache/models ~/models
|
| 96 |
+
- name: Prepare gsm8k dataset
|
| 97 |
+
run: |
|
| 98 |
+
ray stop --force
|
| 99 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k --local_dir ${HOME}/data/gsm8k
|
| 100 |
+
- name: Running vllm generative reward model tests on 8 NPUs
|
| 101 |
+
run: |
|
| 102 |
+
ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_reward_model_genrm.py
|
| 103 |
+
- name: Running vllm discriminative reward model tests on 8 NPUs
|
| 104 |
+
run: |
|
| 105 |
+
ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_reward_model_disrm.py
|
| 106 |
+
- name: Running vllm agent loop with reward manager tests on 8 NPUs
|
| 107 |
+
run: |
|
| 108 |
+
ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_standalone.py
|
| 109 |
+
- name: Running vllm agent loop with reward model colocate tests on 8 NPUs
|
| 110 |
+
run: |
|
| 111 |
+
export HCCL_HOST_SOCKET_PORT_RANGE=auto
|
| 112 |
+
export HCCL_NPU_SOCKET_PORT_RANGE=auto
|
| 113 |
+
ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_colocate.py
|
.github/workflows/sanity.yml
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
# name: Check PR Title
|
| 32 |
+
|
| 33 |
+
name: sanity
|
| 34 |
+
|
| 35 |
+
on:
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
pull_request:
|
| 43 |
+
branches:
|
| 44 |
+
- main
|
| 45 |
+
- v0.*
|
| 46 |
+
paths:
|
| 47 |
+
- "**/*.py"
|
| 48 |
+
- .github/workflows/sanity.yml
|
| 49 |
+
- "tests/special_sanity/**"
|
| 50 |
+
|
| 51 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 52 |
+
concurrency:
|
| 53 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 54 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 55 |
+
|
| 56 |
+
# Declare permissions just read content.
|
| 57 |
+
permissions:
|
| 58 |
+
contents: read
|
| 59 |
+
|
| 60 |
+
jobs:
|
| 61 |
+
sanity:
|
| 62 |
+
runs-on: ubuntu-latest
|
| 63 |
+
timeout-minutes: 5 # Increase this timeout value as needed
|
| 64 |
+
strategy:
|
| 65 |
+
matrix:
|
| 66 |
+
python-version: ["3.10"]
|
| 67 |
+
steps:
|
| 68 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 69 |
+
- name: Set up Python ${{ matrix.python-version }}
|
| 70 |
+
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
|
| 71 |
+
with:
|
| 72 |
+
python-version: ${{ matrix.python-version }}
|
| 73 |
+
- name: Install the current repository
|
| 74 |
+
run: |
|
| 75 |
+
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu
|
| 76 |
+
pip3 install -r requirements.txt
|
| 77 |
+
pip3 install -r requirements-test.txt
|
| 78 |
+
pip3 install --no-deps -e .
|
| 79 |
+
- name: Run sanity test
|
| 80 |
+
run: |
|
| 81 |
+
pytest -s -x tests/special_sanity
|
| 82 |
+
- name: Run license test
|
| 83 |
+
run: |
|
| 84 |
+
python3 tests/special_sanity/check_license.py --directories .
|
| 85 |
+
- name: Assert naming convention
|
| 86 |
+
run: |
|
| 87 |
+
if grep -rIn --exclude-dir=.git --exclude-dir=.github --exclude-dir=venv --exclude-dir=__pycache__ 'veRL' .; then
|
| 88 |
+
echo "Please use verl instead of veRL in the codebase"
|
| 89 |
+
exit 1
|
| 90 |
+
fi
|
| 91 |
+
- name: Assert SGLang naming convention
|
| 92 |
+
run: |
|
| 93 |
+
if grep -rIn --exclude-dir=.git --exclude-dir=.github --exclude-dir=venv --exclude-dir=__pycache__ --exclude=ascend_sglang_best_practices.rst -E 'Sglang|sgLang|sglAng|sglaNg|sglanG' .; then
|
| 94 |
+
echo "Please use SGLang or sglang as the formal name of SGLang rollout engine"
|
| 95 |
+
exit 1
|
| 96 |
+
fi
|
| 97 |
+
- name: Validate test folder structure
|
| 98 |
+
run: python3 tests/special_sanity/validate_structure.py
|
| 99 |
+
- name: Assert documentation requirement for functions
|
| 100 |
+
run: python3 tests/special_sanity/validate_imported_docs.py
|
| 101 |
+
- name: Assert device api usage in verl/verl
|
| 102 |
+
run: python3 tests/special_sanity/check_device_api_usage.py --directory ./verl
|
| 103 |
+
- name: Assert documentation time info
|
| 104 |
+
run: python3 tests/special_sanity/check_docs_time_info.py
|
| 105 |
+
- name: Check docstrings for specified files
|
| 106 |
+
run: python3 tests/special_sanity/check_docstrings.py
|
| 107 |
+
- name: Check DataProto for specified folders
|
| 108 |
+
run: python3 tests/special_sanity/check_dataproto_usage.py -d ./verl/workers/engine
|
.github/workflows/scorecard.yml
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# This workflow uses actions that are not certified by GitHub. They are provided
|
| 2 |
+
# by a third-party and are governed by separate terms of service, privacy
|
| 3 |
+
# policy, and support documentation.
|
| 4 |
+
|
| 5 |
+
name: Scorecard supply-chain security
|
| 6 |
+
on:
|
| 7 |
+
# For Branch-Protection check. Only the default branch is supported. See
|
| 8 |
+
# https://github.com/ossf/scorecard/blob/main/docs/checks.md#branch-protection
|
| 9 |
+
branch_protection_rule:
|
| 10 |
+
# To guarantee Maintained check is occasionally updated. See
|
| 11 |
+
# https://github.com/ossf/scorecard/blob/main/docs/checks.md#maintained
|
| 12 |
+
schedule:
|
| 13 |
+
- cron: "27 7 * * 1"
|
| 14 |
+
push:
|
| 15 |
+
branches:
|
| 16 |
+
- main
|
| 17 |
+
- v0.*
|
| 18 |
+
|
| 19 |
+
# Declare default permissions as read only.
|
| 20 |
+
permissions: read-all
|
| 21 |
+
|
| 22 |
+
jobs:
|
| 23 |
+
analysis:
|
| 24 |
+
name: Scorecard analysis
|
| 25 |
+
runs-on: ubuntu-latest
|
| 26 |
+
permissions:
|
| 27 |
+
# Needed to upload the results to code-scanning dashboard.
|
| 28 |
+
security-events: write
|
| 29 |
+
# Needed to publish results and get a badge (see publish_results below).
|
| 30 |
+
id-token: write
|
| 31 |
+
# Uncomment the permissions below if installing in a private repository.
|
| 32 |
+
# contents: read
|
| 33 |
+
# actions: read
|
| 34 |
+
|
| 35 |
+
steps:
|
| 36 |
+
- name: "Checkout code"
|
| 37 |
+
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
|
| 38 |
+
with:
|
| 39 |
+
persist-credentials: false
|
| 40 |
+
|
| 41 |
+
- name: "Run analysis"
|
| 42 |
+
uses: ossf/scorecard-action@0864cf19026789058feabb7e87baa5f140aac736 # v2.3.1
|
| 43 |
+
with:
|
| 44 |
+
results_file: results.sarif
|
| 45 |
+
results_format: sarif
|
| 46 |
+
# (Optional) "write" PAT token. Uncomment the `repo_token` line below if:
|
| 47 |
+
# - you want to enable the Branch-Protection check on a *public* repository, or
|
| 48 |
+
# - you are installing Scorecard on a *private* repository
|
| 49 |
+
# To create the PAT, follow the steps in https://github.com/ossf/scorecard-action?tab=readme-ov-file#authentication-with-fine-grained-pat-optional.
|
| 50 |
+
# repo_token: ${{ secrets.SCORECARD_TOKEN }}
|
| 51 |
+
|
| 52 |
+
# Public repositories:
|
| 53 |
+
# - Publish results to OpenSSF REST API for easy access by consumers
|
| 54 |
+
# - Allows the repository to include the Scorecard badge.
|
| 55 |
+
# - See https://github.com/ossf/scorecard-action#publishing-results.
|
| 56 |
+
# For private repositories:
|
| 57 |
+
# - `publish_results` will always be set to `false`, regardless
|
| 58 |
+
# of the value entered here.
|
| 59 |
+
publish_results: true
|
| 60 |
+
|
| 61 |
+
# Upload the results to GitHub's code scanning dashboard (optional).
|
| 62 |
+
# Commenting out will disable upload of results to your repo's Code Scanning dashboard
|
| 63 |
+
- name: "Upload to code-scanning"
|
| 64 |
+
uses: github/codeql-action/upload-sarif@9e8d0789d4a0fa9ceb6b1738f7e269594bdd67f0 #v3.28.9
|
| 65 |
+
with:
|
| 66 |
+
sarif_file: results.sarif
|
.github/workflows/secrets_scan.yml
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
on:
|
| 2 |
+
push:
|
| 3 |
+
branches:
|
| 4 |
+
- main
|
| 5 |
+
- v0.*
|
| 6 |
+
pull_request:
|
| 7 |
+
|
| 8 |
+
permissions:
|
| 9 |
+
contents: read
|
| 10 |
+
|
| 11 |
+
jobs:
|
| 12 |
+
test:
|
| 13 |
+
runs-on: ubuntu-latest
|
| 14 |
+
steps:
|
| 15 |
+
- name: Checkout code
|
| 16 |
+
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
|
| 17 |
+
with:
|
| 18 |
+
fetch-depth: 0
|
| 19 |
+
- name: Secret Scanning
|
| 20 |
+
uses: trufflesecurity/trufflehog@7dc056a193116ba8d82154bf0549381c8fb8545c # v3.88.14
|
| 21 |
+
with:
|
| 22 |
+
extra_args: --results=verified,unknown
|
.github/workflows/sgl.yml
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: sgl
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# workflow_dispatch: # Manual
|
| 36 |
+
# Trigger the workflow on push or pull request,
|
| 37 |
+
# but only for the main branch
|
| 38 |
+
push:
|
| 39 |
+
branches:
|
| 40 |
+
- main
|
| 41 |
+
- v0.*
|
| 42 |
+
paths:
|
| 43 |
+
- "**/*.py"
|
| 44 |
+
- .github/workflows/sgl.yml
|
| 45 |
+
pull_request:
|
| 46 |
+
branches:
|
| 47 |
+
- main
|
| 48 |
+
- v0.*
|
| 49 |
+
paths:
|
| 50 |
+
- "**/*.py"
|
| 51 |
+
# Other entrypoints
|
| 52 |
+
- "!examples/**"
|
| 53 |
+
- "!tests/**"
|
| 54 |
+
- "!verl/trainer/main_*.py"
|
| 55 |
+
- "!verl/trainer/fsdp_sft_trainer.py" # FSDP
|
| 56 |
+
- "!verl/workers/**/*dp_*.py"
|
| 57 |
+
# Megatron
|
| 58 |
+
- "!verl/workers/**/megatron_*.py"
|
| 59 |
+
# vLLM
|
| 60 |
+
- "!**/*vllm*"
|
| 61 |
+
|
| 62 |
+
# Entrypoints
|
| 63 |
+
- ".github/workflows/sgl.yml"
|
| 64 |
+
- "tests/rollout/*sglang*"
|
| 65 |
+
- "tests/rollout/async_rollout_utils.py"
|
| 66 |
+
- "tests/workers/rollout/*interaction*"
|
| 67 |
+
|
| 68 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 69 |
+
concurrency:
|
| 70 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 71 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 72 |
+
|
| 73 |
+
# Declare permissions just read content.
|
| 74 |
+
permissions:
|
| 75 |
+
contents: read
|
| 76 |
+
|
| 77 |
+
env:
|
| 78 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
|
| 79 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 80 |
+
|
| 81 |
+
jobs:
|
| 82 |
+
setup:
|
| 83 |
+
if: github.repository_owner == 'verl-project'
|
| 84 |
+
runs-on: ubuntu-latest
|
| 85 |
+
outputs:
|
| 86 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 87 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 88 |
+
steps:
|
| 89 |
+
- uses: actions/checkout@v4
|
| 90 |
+
- id: create-runner
|
| 91 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 92 |
+
with:
|
| 93 |
+
mode: "create"
|
| 94 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 95 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 96 |
+
|
| 97 |
+
sgl:
|
| 98 |
+
needs: setup
|
| 99 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 100 |
+
timeout-minutes: 35 # Increase this timeout value as needed
|
| 101 |
+
env:
|
| 102 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 103 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 104 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 105 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 106 |
+
HF_HUB_ENABLE_HF_TRANSFER: 1
|
| 107 |
+
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
|
| 108 |
+
NCCL_SHM_DISABLE: "1"
|
| 109 |
+
NCCL_P2P_DISABLE: "1"
|
| 110 |
+
steps:
|
| 111 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 112 |
+
with:
|
| 113 |
+
fetch-depth: 0
|
| 114 |
+
- name: Install the current repository
|
| 115 |
+
run: |
|
| 116 |
+
pip3 install cupy-cuda12x==13.6.0 pytest-asyncio
|
| 117 |
+
pip3 install hf_transfer fastmcp pytest-asyncio
|
| 118 |
+
pip3 install -r requirements-test.txt
|
| 119 |
+
pip3 install --no-deps -e .
|
| 120 |
+
- name: Prepare gsm8k dataset
|
| 121 |
+
run: |
|
| 122 |
+
ray stop --force
|
| 123 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 124 |
+
- name: Test the latest SGLang Rollout async with agent loop
|
| 125 |
+
run: |
|
| 126 |
+
ROLLOUT_NAME=sglang pytest -svvv tests/experimental/agent_loop
|
| 127 |
+
|
| 128 |
+
sgl_checkpoint_engine:
|
| 129 |
+
needs: setup
|
| 130 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 131 |
+
timeout-minutes: 35 # Increase this timeout value as needed
|
| 132 |
+
env:
|
| 133 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 134 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 135 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 136 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 137 |
+
HF_HUB_ENABLE_HF_TRANSFER: 1
|
| 138 |
+
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
|
| 139 |
+
NCCL_SHM_DISABLE: "1"
|
| 140 |
+
NCCL_P2P_DISABLE: "1"
|
| 141 |
+
steps:
|
| 142 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 143 |
+
with:
|
| 144 |
+
fetch-depth: 0
|
| 145 |
+
- name: Install the current repository
|
| 146 |
+
run: |
|
| 147 |
+
pip3 install cupy-cuda12x==13.6.0 pytest-asyncio
|
| 148 |
+
pip3 install hf_transfer fastmcp pytest-asyncio
|
| 149 |
+
pip3 install -r requirements-test.txt
|
| 150 |
+
pip3 install --no-deps -e .
|
| 151 |
+
- name: Test SGLang ServerAdapter with Checkpoint Engine (NCCL)
|
| 152 |
+
run: |
|
| 153 |
+
ROLLOUT_NAME=sglang pytest -svvv tests/checkpoint_engine/test_special_server_adapter.py
|
| 154 |
+
|
| 155 |
+
cleanup:
|
| 156 |
+
runs-on: ubuntu-latest
|
| 157 |
+
needs: [setup, sgl, sgl_checkpoint_engine]
|
| 158 |
+
if: always()
|
| 159 |
+
steps:
|
| 160 |
+
- id: destroy-runner
|
| 161 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 162 |
+
with:
|
| 163 |
+
mode: "destroy"
|
| 164 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 165 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.github/workflows/type-coverage-check.yml
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: Type Annotation and Docstring Coverage
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
pull_request:
|
| 5 |
+
paths:
|
| 6 |
+
- '**/*.py'
|
| 7 |
+
- '.github/workflows/type-coverage-check.yml'
|
| 8 |
+
|
| 9 |
+
jobs:
|
| 10 |
+
type-coverage-check:
|
| 11 |
+
runs-on: ubuntu-latest
|
| 12 |
+
steps:
|
| 13 |
+
- uses: actions/checkout@v4
|
| 14 |
+
with:
|
| 15 |
+
fetch-depth: 0 # 🚨 Important: fetch full history so `origin/main` is available
|
| 16 |
+
- name: Set up Python
|
| 17 |
+
uses: actions/setup-python@v5
|
| 18 |
+
with:
|
| 19 |
+
python-version: '3.10'
|
| 20 |
+
|
| 21 |
+
- name: Install dependencies
|
| 22 |
+
run: |
|
| 23 |
+
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu
|
| 24 |
+
pip3 install -r requirements.txt
|
| 25 |
+
pip3 install --no-deps -e .
|
| 26 |
+
- name: Run type annotation coverage check
|
| 27 |
+
run: |
|
| 28 |
+
python3 tests/special_sanity/type_coverage_check.py
|
| 29 |
+
- name: Run docstring coverage check
|
| 30 |
+
run: |
|
| 31 |
+
python3 tests/special_sanity/check_api_docs.py verl
|
.github/workflows/vllm.yml
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# # Tests layout
|
| 2 |
+
|
| 3 |
+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
|
| 4 |
+
# - `tests/trainer` for testing functionality related to `verl/trainer`
|
| 5 |
+
# - `tests/models` for testing functionality related to `verl/models`
|
| 6 |
+
# - ...
|
| 7 |
+
|
| 8 |
+
# There are a few folders with `special_` prefix, created for special purposes:
|
| 9 |
+
# - `special_distributed`: unit tests that must run with multiple GPUs
|
| 10 |
+
# - `special_e2e`: end-to-end tests with training/generation scripts
|
| 11 |
+
# - `special_npu`: tests for NPUs
|
| 12 |
+
# - `special_sanity`: a suite of quick sanity tests
|
| 13 |
+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
|
| 14 |
+
|
| 15 |
+
# Accelerators for tests
|
| 16 |
+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
|
| 17 |
+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
|
| 18 |
+
|
| 19 |
+
# # Workflow layout
|
| 20 |
+
|
| 21 |
+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
|
| 22 |
+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
|
| 23 |
+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
|
| 24 |
+
# 3. End-to-end tests: `e2e_*.yml`
|
| 25 |
+
# 4. Unit tests
|
| 26 |
+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
|
| 27 |
+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
|
| 28 |
+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
|
| 29 |
+
# - new workflow yaml is added to `.github/workflows`
|
| 30 |
+
# - new tests are added to workflow mentioned in 2.
|
| 31 |
+
|
| 32 |
+
name: vllm
|
| 33 |
+
|
| 34 |
+
on:
|
| 35 |
+
# Trigger the workflow on push or pull request,
|
| 36 |
+
# but only for the main branch
|
| 37 |
+
push:
|
| 38 |
+
branches:
|
| 39 |
+
- main
|
| 40 |
+
- v0.*
|
| 41 |
+
pull_request:
|
| 42 |
+
branches:
|
| 43 |
+
- main
|
| 44 |
+
- v0.*
|
| 45 |
+
paths:
|
| 46 |
+
- "**/*.py"
|
| 47 |
+
# Other entrypoints
|
| 48 |
+
- "!examples/**"
|
| 49 |
+
- "!tests/**"
|
| 50 |
+
- "!verl/trainer/main_*.py"
|
| 51 |
+
- "!verl/trainer/fsdp_sft_trainer.py"
|
| 52 |
+
# FSDP
|
| 53 |
+
- "!verl/workers/**/*dp_*.py"
|
| 54 |
+
# Megatron
|
| 55 |
+
- "!verl/workers/**/megatron_*.py"
|
| 56 |
+
# SGLang
|
| 57 |
+
- "!**/*sglang*"
|
| 58 |
+
# Entrypoints
|
| 59 |
+
- ".github/workflows/vllm.yml"
|
| 60 |
+
- "tests/special_e2e/generation"
|
| 61 |
+
- "tests/workers/rollout"
|
| 62 |
+
- "verl/trainer/main_generation.py"
|
| 63 |
+
- "verl/trainer/config/generation.yaml"
|
| 64 |
+
|
| 65 |
+
# Cancel jobs on the same ref if a new one is triggered
|
| 66 |
+
concurrency:
|
| 67 |
+
group: ${{ github.workflow }}-${{ github.ref }}
|
| 68 |
+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
| 69 |
+
|
| 70 |
+
# Declare permissions just read content.
|
| 71 |
+
permissions:
|
| 72 |
+
contents: read
|
| 73 |
+
|
| 74 |
+
env:
|
| 75 |
+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
|
| 76 |
+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
| 77 |
+
|
| 78 |
+
jobs:
|
| 79 |
+
setup:
|
| 80 |
+
if: github.repository_owner == 'verl-project'
|
| 81 |
+
runs-on: ubuntu-latest
|
| 82 |
+
outputs:
|
| 83 |
+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
|
| 84 |
+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
|
| 85 |
+
steps:
|
| 86 |
+
- uses: actions/checkout@v4
|
| 87 |
+
- id: create-runner
|
| 88 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 89 |
+
with:
|
| 90 |
+
mode: "create"
|
| 91 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 92 |
+
mlp-image: "${{ env.IMAGE }}"
|
| 93 |
+
|
| 94 |
+
vllm:
|
| 95 |
+
needs: setup
|
| 96 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 97 |
+
timeout-minutes: 35 # Increase this timeout value as needed
|
| 98 |
+
env:
|
| 99 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 100 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 101 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 102 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 103 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 104 |
+
steps:
|
| 105 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 106 |
+
with:
|
| 107 |
+
fetch-depth: 0
|
| 108 |
+
- name: Install the current repository
|
| 109 |
+
run: |
|
| 110 |
+
pip3 install -r requirements-test.txt
|
| 111 |
+
pip3 install --no-deps -e .
|
| 112 |
+
pip3 install --upgrade "transformers<5.0"
|
| 113 |
+
# - name: Download Model to Use
|
| 114 |
+
# run: |
|
| 115 |
+
# hf download Qwen/Qwen2.5-0.5B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct
|
| 116 |
+
# hf download Qwen/Qwen2.5-1.5B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-1.5B-Instruct
|
| 117 |
+
# hf download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-VL-3B-Instruct
|
| 118 |
+
# hf download OldKingMeister/Qwen2.5-1.5B-Instruct-YaRN --local-dir ${HOME}/models/OldKingMeister/Qwen2.5-1.5B-Instruct-YaRN
|
| 119 |
+
# export HF_HUB_OFFLINE=1
|
| 120 |
+
- name: Prepare gsm8k dataset
|
| 121 |
+
run: |
|
| 122 |
+
ray stop --force
|
| 123 |
+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
|
| 124 |
+
- name: Test the latest vLLM Rollout async with agent loop
|
| 125 |
+
run: |
|
| 126 |
+
ROLLOUT_NAME=vllm pytest -svvv tests/experimental/agent_loop
|
| 127 |
+
- name: Test vllm server abort functionality
|
| 128 |
+
run: |
|
| 129 |
+
pytest tests/workers/rollout/rollout_vllm/test_vllm_abort.py -v -s
|
| 130 |
+
|
| 131 |
+
vllm_checkpoint_engine:
|
| 132 |
+
needs: setup
|
| 133 |
+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
|
| 134 |
+
timeout-minutes: 35 # Increase this timeout value as needed
|
| 135 |
+
env:
|
| 136 |
+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
| 137 |
+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
| 138 |
+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
| 139 |
+
HF_ENDPOINT: "https://hf-mirror.com"
|
| 140 |
+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
| 141 |
+
steps:
|
| 142 |
+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
| 143 |
+
with:
|
| 144 |
+
fetch-depth: 0
|
| 145 |
+
- name: Install the current repository
|
| 146 |
+
run: |
|
| 147 |
+
pip3 install pytest-asyncio
|
| 148 |
+
pip3 install -r requirements-test.txt
|
| 149 |
+
pip3 install --no-deps -e .
|
| 150 |
+
pip3 install --upgrade "transformers<5.0"
|
| 151 |
+
pip3 install cupy-cuda12x==13.6.0
|
| 152 |
+
- name: Test vLLM ServerAdapter with Checkpoint Engine (NCCL)
|
| 153 |
+
run: |
|
| 154 |
+
ROLLOUT_NAME=vllm pytest -svvv tests/checkpoint_engine/test_special_server_adapter.py
|
| 155 |
+
- name: Test bucketed weight transfer
|
| 156 |
+
run: |
|
| 157 |
+
pytest -svvv tests/utils/test_bucketed_weight_transfer.py
|
| 158 |
+
|
| 159 |
+
cleanup:
|
| 160 |
+
runs-on: ubuntu-latest
|
| 161 |
+
needs: [setup, vllm, vllm_checkpoint_engine]
|
| 162 |
+
if: always()
|
| 163 |
+
steps:
|
| 164 |
+
- id: destroy-runner
|
| 165 |
+
uses: volcengine/vemlp-github-runner@v1
|
| 166 |
+
with:
|
| 167 |
+
mode: "destroy"
|
| 168 |
+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
|
| 169 |
+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
|
.gitignore
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
**/*.pt
|
| 2 |
+
**/checkpoints
|
| 3 |
+
**/wget-log
|
| 4 |
+
**/_build/
|
| 5 |
+
**/*.ckpt
|
| 6 |
+
**/outputs
|
| 7 |
+
**/*.tar.gz
|
| 8 |
+
**/playground
|
| 9 |
+
**/wandb
|
| 10 |
+
|
| 11 |
+
/pyrightconfig.json
|
| 12 |
+
|
| 13 |
+
# Byte-compiled / optimized / DLL files
|
| 14 |
+
__pycache__/
|
| 15 |
+
*.py[cod]
|
| 16 |
+
*$py.class
|
| 17 |
+
dataset/*
|
| 18 |
+
tensorflow/my_graph/*
|
| 19 |
+
.idea/
|
| 20 |
+
# C extensions
|
| 21 |
+
*.so
|
| 22 |
+
|
| 23 |
+
# Distribution / packaging
|
| 24 |
+
.Python
|
| 25 |
+
# env/
|
| 26 |
+
build/
|
| 27 |
+
develop-eggs/
|
| 28 |
+
dist/
|
| 29 |
+
downloads/
|
| 30 |
+
eggs/
|
| 31 |
+
.eggs/
|
| 32 |
+
lib/
|
| 33 |
+
lib64/
|
| 34 |
+
parts/
|
| 35 |
+
sdist/
|
| 36 |
+
var/
|
| 37 |
+
tmp/
|
| 38 |
+
*.egg-info/
|
| 39 |
+
.installed.cfg
|
| 40 |
+
*.egg
|
| 41 |
+
|
| 42 |
+
# PyInstaller
|
| 43 |
+
# Usually these files are written by a python script from a template
|
| 44 |
+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
| 45 |
+
*.manifest
|
| 46 |
+
*.spec
|
| 47 |
+
|
| 48 |
+
# Installer logs
|
| 49 |
+
pip-log.txt
|
| 50 |
+
pip-delete-this-directory.txt
|
| 51 |
+
|
| 52 |
+
# Unit test / coverage reports
|
| 53 |
+
htmlcov/
|
| 54 |
+
.tox/
|
| 55 |
+
.coverage
|
| 56 |
+
.coverage.*
|
| 57 |
+
.cache
|
| 58 |
+
nosetests.xml
|
| 59 |
+
coverage.xml
|
| 60 |
+
*,cover
|
| 61 |
+
.hypothesis/
|
| 62 |
+
pytest.ini
|
| 63 |
+
output.txt
|
| 64 |
+
|
| 65 |
+
# Translations
|
| 66 |
+
*.mo
|
| 67 |
+
*.pot
|
| 68 |
+
|
| 69 |
+
# Django stuff:
|
| 70 |
+
*.log
|
| 71 |
+
local_settings.py
|
| 72 |
+
|
| 73 |
+
# Flask stuff:
|
| 74 |
+
instance/
|
| 75 |
+
.webassets-cache
|
| 76 |
+
|
| 77 |
+
# Scrapy stuff:
|
| 78 |
+
.scrapy
|
| 79 |
+
|
| 80 |
+
# Sphinx documentation
|
| 81 |
+
docs/_build/
|
| 82 |
+
|
| 83 |
+
# PyBuilder
|
| 84 |
+
target/
|
| 85 |
+
|
| 86 |
+
# IPython Notebook
|
| 87 |
+
.ipynb_checkpoints
|
| 88 |
+
|
| 89 |
+
# pyenv
|
| 90 |
+
.python-version
|
| 91 |
+
|
| 92 |
+
# celery beat schedule file
|
| 93 |
+
celerybeat-schedule
|
| 94 |
+
|
| 95 |
+
# dotenv
|
| 96 |
+
.env
|
| 97 |
+
|
| 98 |
+
# virtualenv
|
| 99 |
+
venv/
|
| 100 |
+
.venv/
|
| 101 |
+
ENV/
|
| 102 |
+
|
| 103 |
+
# Spyder project settings
|
| 104 |
+
.spyderproject
|
| 105 |
+
|
| 106 |
+
# Rope project settings
|
| 107 |
+
.ropeproject
|
| 108 |
+
|
| 109 |
+
# vscode
|
| 110 |
+
.vscode
|
| 111 |
+
|
| 112 |
+
# Mac
|
| 113 |
+
.DS_Store
|
| 114 |
+
|
| 115 |
+
# vim
|
| 116 |
+
*.swp
|
| 117 |
+
|
| 118 |
+
# emacs
|
| 119 |
+
*~
|
| 120 |
+
|
| 121 |
+
# ckpt
|
| 122 |
+
*.lock
|
| 123 |
+
|
| 124 |
+
# data
|
| 125 |
+
*.parquet
|
| 126 |
+
/eval/data/
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
# local logs
|
| 130 |
+
logs
|
| 131 |
+
log
|
| 132 |
+
outputs
|
| 133 |
+
.history
|
| 134 |
+
/checkpoints/
|
| 135 |
+
/outputs/
|
| 136 |
+
|
| 137 |
+
eval/data/
|
| 138 |
+
|
| 139 |
+
eval/data/
|
.gitmodules
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[submodule "recipe"]
|
| 2 |
+
path = recipe
|
| 3 |
+
url = https://github.com/verl-project/verl-recipe.git
|
.pre-commit-config.yaml
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
repos:
|
| 2 |
+
- repo: https://github.com/astral-sh/ruff-pre-commit
|
| 3 |
+
rev: "v0.12.2"
|
| 4 |
+
hooks:
|
| 5 |
+
- id: ruff
|
| 6 |
+
args: ["--fix", "--show-fixes", "--output-format=full"]
|
| 7 |
+
exclude: ^.*\.(ipynb)$
|
| 8 |
+
- id: ruff-format
|
| 9 |
+
|
| 10 |
+
- repo: https://github.com/pre-commit/mirrors-mypy
|
| 11 |
+
rev: "v1.17.0"
|
| 12 |
+
hooks:
|
| 13 |
+
- id: mypy
|
| 14 |
+
|
| 15 |
+
- repo: local
|
| 16 |
+
hooks:
|
| 17 |
+
- id: autogen-trainer-cfg
|
| 18 |
+
name: Generate and verify verl/trainer/config/_generated_*.yaml
|
| 19 |
+
entry: scripts/generate_trainer_config.sh
|
| 20 |
+
language: script
|
| 21 |
+
pass_filenames: false
|
| 22 |
+
|
| 23 |
+
- repo: local
|
| 24 |
+
hooks:
|
| 25 |
+
- id: check-docstrings
|
| 26 |
+
name: Check doc string coverage
|
| 27 |
+
entry: python3 tests/special_sanity/check_docstrings.py
|
| 28 |
+
language: python
|
| 29 |
+
pass_filenames: false
|
| 30 |
+
|
| 31 |
+
- repo: local
|
| 32 |
+
hooks:
|
| 33 |
+
- id: check-license
|
| 34 |
+
name: Check license
|
| 35 |
+
entry: python3 tests/special_sanity/check_license.py --directories examples scripts tests verl setup.py
|
| 36 |
+
language: python
|
| 37 |
+
pass_filenames: false
|
| 38 |
+
|
| 39 |
+
- repo: local
|
| 40 |
+
hooks:
|
| 41 |
+
- id: compileall
|
| 42 |
+
name: Compile all python files
|
| 43 |
+
entry: sh -c 'PYTHONWARNINGS=error python3 -m compileall -q . -x "(^|[\\/])(\.venv|venv|\.git)([\\/]|$)"'
|
| 44 |
+
language: python
|
| 45 |
+
pass_filenames: false
|
.readthedocs.yaml
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Read the Docs configuration file
|
| 2 |
+
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
|
| 3 |
+
|
| 4 |
+
version: 2
|
| 5 |
+
|
| 6 |
+
build:
|
| 7 |
+
os: ubuntu-22.04
|
| 8 |
+
tools:
|
| 9 |
+
python: "3.11"
|
| 10 |
+
rust: "1.70"
|
| 11 |
+
|
| 12 |
+
sphinx:
|
| 13 |
+
configuration: docs/conf.py
|
| 14 |
+
|
| 15 |
+
python:
|
| 16 |
+
install:
|
| 17 |
+
- requirements: docs/requirements-docs.txt
|
| 18 |
+
- method: pip
|
| 19 |
+
path: .
|
CONTRIBUTING.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Contributing to verl
|
| 2 |
+
|
| 3 |
+
Thank you for considering a contribution to verl! We welcome contributions of any kind - bug fixes, enhancements, documentation improvements, or even just feedback. Whether you're an experienced developer or this is your first open-source project, your help is invaluable.
|
| 4 |
+
|
| 5 |
+
Your support can take many forms:
|
| 6 |
+
- Report issues or unexpected behaviors.
|
| 7 |
+
- Suggest or implement new features.
|
| 8 |
+
- Improve or expand documentation.
|
| 9 |
+
- Review pull requests and assist other contributors.
|
| 10 |
+
- Spread the word: share verl in blog posts, social media, or give the repo a ⭐.
|
| 11 |
+
|
| 12 |
+
## Finding Issues to Contribute
|
| 13 |
+
|
| 14 |
+
Looking for ways to dive in? Check out these issues:
|
| 15 |
+
- [Good first issues](https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22)
|
| 16 |
+
- [Call for contribution](https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22call%20for%20contribution%22)
|
| 17 |
+
Furthermore, you can learn the development plan and roadmap via [RFC](https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20label%3ARFC) and [Roadmap](https://github.com/volcengine/verl/issues?q=state%3Aopen%20label%3A%22roadmap%22).
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
## Developing
|
| 21 |
+
|
| 22 |
+
- **Python-only**: install verl via `pip install -e .[test,vllm]` or `pip install -e .[test,sglang]` and iterate quickly. For full dependency setup, check out the verl [installation doc](https://verl.readthedocs.io/en/latest/start/install.html).
|
| 23 |
+
|
| 24 |
+
## Code Linting and Formatting
|
| 25 |
+
|
| 26 |
+
We rely on pre-commit to keep our code consistent. To set it up:
|
| 27 |
+
|
| 28 |
+
```bash
|
| 29 |
+
pip install pre-commit
|
| 30 |
+
pre-commit install
|
| 31 |
+
# for staged changes
|
| 32 |
+
pre-commit run
|
| 33 |
+
# for all files in the repo
|
| 34 |
+
pre-commit run --all-files
|
| 35 |
+
# run a specific hook with pre-commit
|
| 36 |
+
# pre-commit run --all-files --show-diff-on-failure --color=always <hood-id>
|
| 37 |
+
pre-commit run --all-files --show-diff-on-failure --color=always ruff
|
| 38 |
+
pre-commit run --all-files --show-diff-on-failure --color=always autogen-trainer-cfg
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## Testing
|
| 42 |
+
|
| 43 |
+
Our test suites run on GitHub Actions. Check these workflows for details:
|
| 44 |
+
- [GPU unit tests](https://github.com/volcengine/verl/blob/main/.github/workflows/gpu_unit_tests.yml)
|
| 45 |
+
- [CPU unit tests](https://github.com/volcengine/verl/blob/main/.github/workflows/cpu_unit_tests.yml)
|
| 46 |
+
- [vLLM tests](https://github.com/volcengine/verl/blob/main/.github/workflows/vllm.yml)
|
| 47 |
+
- [SGLang tests](https://github.com/volcengine/verl/blob/main/.github/workflows/sgl.yml)
|
| 48 |
+
|
| 49 |
+
### Adding CI tests
|
| 50 |
+
|
| 51 |
+
If possible, please add CI test(s) for your new feature:
|
| 52 |
+
|
| 53 |
+
1. Find the most relevant workflow yml file, which usually corresponds to a `hydra` default config (e.g. `ppo_trainer`, `ppo_megatron_trainer`, `sft_trainer`, etc).
|
| 54 |
+
2. Add related path patterns to the `paths` section if not already included.
|
| 55 |
+
3. Minimize the workload of the test script(s) (see existing scripts for examples).
|
| 56 |
+
|
| 57 |
+
## Building the Docs
|
| 58 |
+
```
|
| 59 |
+
# Ensure verl is on your PYTHONPATH, e.g.:
|
| 60 |
+
pip install -e .[test]
|
| 61 |
+
|
| 62 |
+
# Install documentation dependencies
|
| 63 |
+
cd docs
|
| 64 |
+
pip install -r requirements-docs.txt
|
| 65 |
+
|
| 66 |
+
# Generate HTML docs
|
| 67 |
+
make clean
|
| 68 |
+
make html
|
| 69 |
+
|
| 70 |
+
# Preview locally
|
| 71 |
+
python -m http.server -d _build/html/
|
| 72 |
+
```
|
| 73 |
+
Open your browser at http://localhost:8000 to explore the docs.
|
| 74 |
+
|
| 75 |
+
## Pull Requests & Code Reviews
|
| 76 |
+
|
| 77 |
+
Thanks for submitting a PR! To streamline reviews:
|
| 78 |
+
- Follow our Pull Request Template for title format and checklist.
|
| 79 |
+
- Adhere to our pre-commit lint rules and ensure all checks pass.
|
| 80 |
+
- Update docs for any user-facing changes.
|
| 81 |
+
- Add or update tests in the CI workflows, or explain why tests aren't applicable.
|
| 82 |
+
|
| 83 |
+
## License
|
| 84 |
+
|
| 85 |
+
See the [LICENSE](https://github.com/volcengine/verl/blob/main/LICENSE) file for full details.
|
| 86 |
+
|
| 87 |
+
## Thank You
|
| 88 |
+
|
| 89 |
+
We appreciate your contributions to verl. Your efforts help make the project stronger and more user-friendly. Happy coding!
|
| 90 |
+
|