LeTue09 commited on
Commit
1faccd4
·
0 Parent(s):

initial clean commit

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gemini/config.yaml +10 -0
  2. .git-blame-ignore-revs +13 -0
  3. .github/CODEOWNERS +27 -0
  4. .github/ISSUE_TEMPLATE/bug-report.yml +65 -0
  5. .github/ISSUE_TEMPLATE/config.yml +2 -0
  6. .github/ISSUE_TEMPLATE/feature-request.yml +32 -0
  7. .github/PULL_REQUEST_TEMPLATE.md +41 -0
  8. .github/dependabot.yml +9 -0
  9. .github/workflows/README.md +73 -0
  10. .github/workflows/check-pr-title.yml +58 -0
  11. .github/workflows/cpu_unit_tests.yml +118 -0
  12. .github/workflows/doc.yml +101 -0
  13. .github/workflows/docker-build-ascend-a2.yml +84 -0
  14. .github/workflows/docker-build-ascend-a3.yml +84 -0
  15. .github/workflows/e2e_ascend.yml +166 -0
  16. .github/workflows/e2e_fully_async_policy.yml +170 -0
  17. .github/workflows/e2e_one_step_off_policy.yml +171 -0
  18. .github/workflows/e2e_one_step_off_policy_ascend.yml +169 -0
  19. .github/workflows/e2e_ppo_grpo_trainer_trtllm.yml +285 -0
  20. .github/workflows/e2e_ppo_trainer.yml +78 -0
  21. .github/workflows/e2e_ppo_trainer_megatron_sglang.yml +201 -0
  22. .github/workflows/e2e_ppo_trainer_megatron_sglang_2.yml +201 -0
  23. .github/workflows/e2e_ppo_trainer_megatron_vllm.yml +212 -0
  24. .github/workflows/e2e_ppo_trainer_megatron_vllm_2.yml +318 -0
  25. .github/workflows/e2e_ppo_trainer_megatron_vllm_2_ascend.yml +233 -0
  26. .github/workflows/e2e_ppo_trainer_veomni_vllm.yml +153 -0
  27. .github/workflows/e2e_sft_llm.yml +153 -0
  28. .github/workflows/e2e_sft_llm_ascend.yml +160 -0
  29. .github/workflows/e2e_sft_vlm.yml +128 -0
  30. .github/workflows/gpu_unit_tests.yml +137 -0
  31. .github/workflows/model.yml +184 -0
  32. .github/workflows/model_ascend.yml +137 -0
  33. .github/workflows/nightly_ascend.yml +174 -0
  34. .github/workflows/npu_unit_tests.yml +126 -0
  35. .github/workflows/pre-commit.yml +41 -0
  36. .github/workflows/precommit-autofix.yml +52 -0
  37. .github/workflows/reward_model_sglang.yml +134 -0
  38. .github/workflows/reward_model_vllm.yml +134 -0
  39. .github/workflows/reward_model_vllm_ascend.yml +113 -0
  40. .github/workflows/sanity.yml +108 -0
  41. .github/workflows/scorecard.yml +66 -0
  42. .github/workflows/secrets_scan.yml +22 -0
  43. .github/workflows/sgl.yml +165 -0
  44. .github/workflows/type-coverage-check.yml +31 -0
  45. .github/workflows/vllm.yml +169 -0
  46. .gitignore +139 -0
  47. .gitmodules +3 -0
  48. .pre-commit-config.yaml +45 -0
  49. .readthedocs.yaml +19 -0
  50. CONTRIBUTING.md +90 -0
.gemini/config.yaml ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ have_fun: false
2
+ code_review:
3
+ disable: false
4
+ comment_severity_threshold: HIGH
5
+ max_review_comments: -1
6
+ pull_request_opened:
7
+ help: false
8
+ summary: false
9
+ code_review: true
10
+ ignore_patterns: []
.git-blame-ignore-revs ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Local uasge: git config blame.ignoreRevsFile .git-blame-ignore-revs
2
+
3
+ # [dev] feat: immigrate from yapf & pylint to ruff based on pre-commit
4
+ # Changed 268 files, +10k/-9k lines. This is the biggest formatter change.
5
+ b00f77d8559b48d57a33c0132a5ba1c81891a536
6
+
7
+ # [ci] refactor: reduce ruff line-length from 300 to 120
8
+ # Changed 238 files, +6k/-1k lines. Global formatting change.
9
+ 00a10a8ef389556f957a2f36132b2358fd6a109f
10
+
11
+ # [Lint] fix: linting errors in all files
12
+ # Changed 179 files, +1k/-3k lines. Global lint fix.
13
+ 8e5ad4688a13de81727c014a3c2e2fb26324bc20
.github/CODEOWNERS ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /docs @eric-haibin-lin @zhaochenyang20 @hongpeng-guo
2
+ /docs/amd_tutorial @yushengsu-thu
3
+ /docs/slang_multiturn @zhaochenyang20 @SwordFaith
4
+ /docs/ascend_tutorial @FightingZhen
5
+
6
+ /third_party/sglang @zhaochenyang20 @SwordFaith
7
+ /third_party/vllm @PeterSH6 @wuxibin89
8
+
9
+ /examples/grpo_trainer @vermouth1992 @PeterSH6 @tardis-key @FightingZhen @ji-huazhong
10
+
11
+ /verl/single_controller @zw0610 @wuxibin89 @hongpeng-guo
12
+ /verl/trainer @eric-haibin-lin @vermouth1992 @tongyx361 @PeterSH6
13
+ /verl/models/mcore @ISEEKYAN @vermouth1992
14
+ /verl/models/transformers @vermouth1992 @PeterSH6 @tardis-key @FightingZhen @ji-huazhong
15
+ /verl/workers/engine @eric-haibin-lin @vermouth1992 @ZihengJiang
16
+ /verl/workers/roles @eric-haibin-lin @vermouth1992 @ZihengJiang
17
+ /verl/workers/engine/fsdp @eric-haibin-lin @vermouth1992 @ZihengJiang
18
+ /verl/workers/rollout/vllm_rollout @wuxibin89 @PeterSH6 @chenhaiq
19
+ /verl/workers/rollout/sglang_rollout @zhaochenyang20 @SwordFaith @chenhaiq
20
+ /verl/workers/actor/megatron_actor.py @ISEEKYAN @vermouth1992
21
+ /verl/workers/critic/megatron_critic.py @ISEEKYAN @vermouth1992
22
+ /verl/workers/megatron_workers.py @ISEEKYAN @vermouth1992
23
+ /verl/experimental @wuxibin89 @ArronHZG
24
+
25
+ /tests/single_controller @zw0610 @wuxibin89
26
+ /tests/trainer @eric-haibin-lin @vermouth1992 @tongyx361 @PeterSH6
27
+ /tests/workers/rollout/vllm_rollout @wuxibin89 @PeterSH6 @chenhaiq
.github/ISSUE_TEMPLATE/bug-report.yml ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # modified from https://github.com/huggingface/transformers/blob/main/.github/ISSUE_TEMPLATE/bug-report.yml?plain=1
2
+ name: "\U0001F41B Bug Report"
3
+ description: Submit a bug report to help us improve verl
4
+ labels: [ "bug" ]
5
+ body:
6
+ - type: markdown
7
+ attributes:
8
+ value: |
9
+ Thanks for taking the time to fill out this bug report! 🤗
10
+
11
+ - type: textarea
12
+ id: system-info
13
+ attributes:
14
+ label: System Info
15
+ description: Please share your system info with us. You can run the command `python scripts/diagnose.py` and copy-paste its output below.
16
+ placeholder: verl version, platform, python version, ...
17
+ validations:
18
+ required: true
19
+
20
+ - type: checkboxes
21
+ id: information-scripts-examples
22
+ attributes:
23
+ label: Information
24
+ description: 'The problem arises when using:'
25
+ options:
26
+ - label: "The official example scripts"
27
+ - label: "My own modified scripts"
28
+
29
+ - type: checkboxes
30
+ id: information-tasks
31
+ attributes:
32
+ label: Tasks
33
+ description: "The tasks I am working on are:"
34
+ options:
35
+ - label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
36
+ - label: "My own task or dataset (give details below)"
37
+
38
+ - type: textarea
39
+ id: reproduction
40
+ validations:
41
+ required: true
42
+ attributes:
43
+ label: Reproduction
44
+ description: |
45
+ Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
46
+ Please include relevant config information with your code.
47
+ If you have code snippets, error messages, stack traces please provide them here as well.
48
+ Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
49
+ Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
50
+
51
+ placeholder: |
52
+ Steps to reproduce the behavior:
53
+
54
+ 1.
55
+ 2.
56
+ 3.
57
+
58
+
59
+ - type: textarea
60
+ id: expected-behavior
61
+ validations:
62
+ required: true
63
+ attributes:
64
+ label: Expected behavior
65
+ description: "A clear and concise description of what you would expect to happen."
.github/ISSUE_TEMPLATE/config.yml ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ blank_issues_enabled: true
2
+ version: 0.1
.github/ISSUE_TEMPLATE/feature-request.yml ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # modified from https://github.com/huggingface/transformers/blob/main/.github/ISSUE_TEMPLATE/feature-request.yml?plain=1
2
+ name: "\U0001F680 Feature request"
3
+ description: Submit a proposal/request for a new verl feature
4
+ labels: [ "Feature request" ]
5
+ body:
6
+ - type: textarea
7
+ id: feature-request
8
+ validations:
9
+ required: true
10
+ attributes:
11
+ label: Feature request
12
+ description: |
13
+ A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist.
14
+
15
+ - type: textarea
16
+ id: motivation
17
+ validations:
18
+ required: true
19
+ attributes:
20
+ label: Motivation
21
+ description: |
22
+ Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.
23
+
24
+
25
+ - type: textarea
26
+ id: contribution
27
+ validations:
28
+ required: true
29
+ attributes:
30
+ label: Your contribution
31
+ description: |
32
+ Is there any way that you could help, e.g. by submitting a PR? Make sure to read the CONTRIBUTING.MD [readme](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md)
.github/PULL_REQUEST_TEMPLATE.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### What does this PR do?
2
+
3
+ > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.
4
+
5
+ ### Checklist Before Starting
6
+
7
+ - [ ] Search for similar PRs. Paste at least one query link here: ...
8
+ - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI)
9
+ - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off`
10
+ - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]`
11
+ - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
12
+ - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title.
13
+ - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
14
+
15
+ ### Test
16
+
17
+ > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
18
+
19
+ ### API and Usage Example
20
+
21
+ > Demonstrate how the API changes if any, and provide usage example(s) if possible.
22
+
23
+ ```python
24
+ # Add code snippet or script demonstrating how to use this
25
+ ```
26
+
27
+ ### Design & Code Changes
28
+
29
+ > Demonstrate the high-level design if this PR is complex, and list the specific changes.
30
+
31
+ ### Checklist Before Submitting
32
+
33
+ > [!IMPORTANT]
34
+ > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
35
+
36
+ - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
37
+ - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always`
38
+ - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs).
39
+ - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ...
40
+ - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
41
+ - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
.github/dependabot.yml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ ## Enabled the dependabot to check the dependencies of the project
2
+ ## Dependabot will open pull requests to update dependencies automatically
3
+
4
+ version: 2
5
+ updates:
6
+ - package-ecosystem: pip
7
+ directory: "/"
8
+ schedule:
9
+ interval: weekly
.github/workflows/README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Adding a New Workflow
2
+
3
+ When adding a new workflow for continuous integration (CI), you have two runner options: a fixed runner or a machine from the vemlp.
4
+
5
+ - **Fixed Runner**: To use a fixed runner, specify it in your workflow using the `runs-on` keyword, like `runs-on: [L20x8]`.
6
+ - **Vemlp Runner**: Opting for a Vemlp machine allows you to launch tasks elastically.
7
+
8
+ Here is a template to assist you. This template is designed for using Vemlp machines. Currently, for each workflow, you need to create a `setup` and a `cleanup` job. When using this template, the main parts you need to modify are the `IMAGE` environment variable and the specific `job steps`.
9
+
10
+ ```yaml
11
+ name: Your Default Workflow
12
+
13
+ on:
14
+ push:
15
+ branches:
16
+ - main
17
+ - v0.*
18
+ pull_request:
19
+ branches:
20
+ - main
21
+ - v0.*
22
+ paths:
23
+ - "**/*.py"
24
+ - ".github/workflows/template.yml"
25
+
26
+ concurrency:
27
+ group: ${{ github.workflow }}-${{ github.ref }}
28
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
29
+
30
+ permissions:
31
+ contents: read
32
+
33
+ env:
34
+ IMAGE: "your vemlp image" # e.g. "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
35
+ DYNAMIC_RUNNER_URL: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner" # public veFaas api
36
+
37
+ jobs:
38
+ setup:
39
+ if: github.repository_owner == 'verl-project'
40
+ runs-on: ubuntu-latest
41
+ outputs:
42
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
43
+ task-id: ${{ steps.create-runner.outputs.task-id }}
44
+ steps:
45
+ - uses: actions/checkout@v4
46
+ - id: create-runner
47
+ uses: volcengine/vemlp-github-runner@v1
48
+ with:
49
+ mode: "create"
50
+ faas-url: "${{ env.DYNAMIC_RUNNER_URL }}"
51
+ image: "${{ env.DEFAULT_IMAGE }}"
52
+
53
+ your_job:
54
+ needs: setup
55
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'default-runner' }}"]
56
+ steps:
57
+ xxxx # your jobs
58
+
59
+ cleanup:
60
+ runs-on: ubuntu-latest
61
+ needs: [setup, your_job]
62
+ if: always()
63
+ steps:
64
+ - id: destroy-runner
65
+ uses: volcengine/vemlp-github-runner@v1
66
+ with:
67
+ mode: "destroy"
68
+ faas-url: "${{ env.DYNAMIC_RUNNER_URL }}"
69
+ task-id: "${{ needs.setup.outputs.task-id }}"
70
+ ```
71
+
72
+ ### Model and Dataset
73
+ To avoid CI relies on network, we pre-download dataset on a NFS on the CI machine. The path for models are \${HOME}/models and the path for dataset is \${HOME}/models/hf_data.
.github/workflows/check-pr-title.yml ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+
33
+ on:
34
+ pull_request:
35
+ types: [opened, edited, synchronize]
36
+
37
+ jobs:
38
+ check-title:
39
+ runs-on: ubuntu-latest
40
+ steps:
41
+ - name: Checkout code
42
+ uses: actions/checkout@v4
43
+
44
+ - name: Set up Python
45
+ uses: actions/setup-python@v5
46
+ with:
47
+ python-version: '3.11'
48
+
49
+ - name: Run PR title checker
50
+ run: python3 tests/special_sanity/check_pr_title.py
51
+ env:
52
+ PR_TITLE: ${{ github.event.pull_request.title }}
53
+
54
+ - name: Run PR description checker
55
+ run: python3 tests/special_sanity/check_pr_description.py
56
+ env:
57
+ PR_TITLE: ${{ github.event.pull_request.title }}
58
+ GITHUB_EVENT_PATH: ${{ github.event_path }}
.github/workflows/cpu_unit_tests.yml ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: cpu_unit_tests
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ push:
38
+ branches:
39
+ - main
40
+ - v0.*
41
+ pull_request:
42
+ branches:
43
+ - main
44
+ - v0.*
45
+ paths:
46
+ - "**/*.py"
47
+ - .github/workflows/cpu_unit_tests.yml
48
+
49
+ # Cancel jobs on the same ref if a new one is triggered
50
+ concurrency:
51
+ group: ${{ github.workflow }}-${{ github.ref }}
52
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
53
+
54
+ # Declare permissions just read content.
55
+ permissions:
56
+ contents: read
57
+
58
+ env:
59
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
60
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
61
+
62
+ jobs:
63
+ setup:
64
+ if: github.repository_owner == 'verl-project'
65
+ runs-on: ubuntu-latest
66
+ outputs:
67
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
68
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
69
+ steps:
70
+ - uses: actions/checkout@v4
71
+ - id: create-runner
72
+ uses: volcengine/vemlp-github-runner@v1
73
+ with:
74
+ mode: "create"
75
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
76
+ mlp-image: "${{ env.IMAGE }}"
77
+
78
+ cpu_unit_tests:
79
+ needs: setup
80
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
81
+ timeout-minutes: 20 # Increase this timeout value as needed
82
+ env:
83
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
84
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
85
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
86
+ HF_ENDPOINT: "https://hf-mirror.com"
87
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
88
+ TORCH_COMPILE_DISABLE: 1
89
+ TORCHINDUCTOR_DISABLE: 1
90
+ steps:
91
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
92
+ with:
93
+ fetch-depth: 0
94
+ - name: Install the current repository
95
+ run: |
96
+ pip3 install -r requirements-test.txt
97
+ pip3 install --no-deps -e .
98
+ pip3 install --upgrade "transformers>=5.0.0"
99
+ - name: Download datasets
100
+ run: |
101
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
102
+ python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/hiyouga/geometry3k
103
+ - name: Running CPU unit tests
104
+ run: |
105
+ echo '[pytest]' > pytest.ini
106
+ echo 'python_files = *_on_cpu.py' >> pytest.ini
107
+ pytest -s -x --asyncio-mode=auto tests/
108
+ cleanup:
109
+ runs-on: ubuntu-latest
110
+ needs: [setup, cpu_unit_tests]
111
+ if: always()
112
+ steps:
113
+ - id: destroy-runner
114
+ uses: volcengine/vemlp-github-runner@v1
115
+ with:
116
+ mode: "destroy"
117
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
118
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/doc.yml ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+
33
+ name: doc_test
34
+
35
+ on:
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ pull_request:
43
+ branches:
44
+ - main
45
+ - v0.*
46
+ paths:
47
+ - "**/*.py"
48
+ - "docs/**"
49
+ - .github/workflows/doc.yml
50
+
51
+ # Cancel jobs on the same ref if a new one is triggered
52
+ concurrency:
53
+ group: ${{ github.workflow }}-${{ github.ref }}
54
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
55
+
56
+ # Declare permissions just read content.
57
+ permissions:
58
+ contents: read # for checkout
59
+ pages: write # for deploy-pages
60
+ id-token: write # for deploy-pages
61
+
62
+ jobs:
63
+ doc_test:
64
+ runs-on: ubuntu-latest
65
+ timeout-minutes: 5 # Increase this timeout value as needed
66
+ strategy:
67
+ matrix:
68
+ python-version: ["3.10"]
69
+ steps:
70
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
71
+ - name: Set up Python ${{ matrix.python-version }}
72
+ uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
73
+ with:
74
+ python-version: ${{ matrix.python-version }}
75
+ - name: Install the current repository
76
+ run: |
77
+ pip3 install -r requirements-test.txt
78
+ pip3 install --no-deps -e .
79
+ pip install -r docs/requirements-docs.txt
80
+
81
+ - name: Run doc make html
82
+ run: |
83
+ cd docs
84
+ make clean
85
+ make html SPHINXOPTS="--keep-going -w _build/sphinx.log"
86
+ if grep -q ": ERROR:" _build/sphinx.log; then
87
+ echo "🚨 Sphinx doc build contained ERRORs - see _build/sphinx.log"
88
+ exit 1
89
+ fi
90
+ if grep -q "WARNING: document isn't included in any toctree" _build/sphinx.log; then
91
+ echo "🚨 Sphinx doc build contained WARNING. Please include newly added docs in index.rst. See _build/sphinx.log for details"
92
+ exit 1
93
+ fi
94
+ if grep -q "WARNING: Inline emphasis" _build/sphinx.log; then
95
+ echo "🚨 Sphinx doc build contained WARNING. Please check inline emphasis is correct. See _build/sphinx.log for details"
96
+ exit 1
97
+ fi
98
+ if grep -q "WARNING: Definition list ends without a blank line" _build/sphinx.log; then
99
+ echo "🚨 Sphinx doc build contained WARNING. Please check if the indentation is correct. See _build/sphinx.log for details"
100
+ exit 1
101
+ fi
.github/workflows/docker-build-ascend-a2.yml ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: docker-build-ascend-a2
2
+
3
+ on:
4
+ workflow_dispatch:
5
+ push:
6
+ branches: ["main"]
7
+ paths:
8
+ - "docker/ascend/Dockerfile.ascend_8.5.0_a2"
9
+ - ".github/workflows/docker-build-ascend-a2.yml"
10
+ release:
11
+ types: [published]
12
+ schedule:
13
+ - cron: "0 16 * * *"
14
+
15
+ jobs:
16
+ build-ascend-image-a2:
17
+ if: ${{ github.event_name != 'pull_request' && github.repository_owner == 'verl-project' }}
18
+ runs-on: ubuntu-latest
19
+ concurrency:
20
+ group: ${{ github.workflow }}-${{ github.ref }}-build-ascend-image-a2
21
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
22
+ steps:
23
+ - name: Remove unnecessary parts in github actions runners to free up disk space
24
+ uses: jlumbroso/free-disk-space@v1.3.1
25
+ with:
26
+ tool-cache: true
27
+
28
+ - name: Checkout code
29
+ uses: actions/checkout@v4
30
+
31
+ - name: Set up Python
32
+ uses: actions/setup-python@v5
33
+ with:
34
+ python-version: "3.11"
35
+
36
+ - name: Get base image name and tag
37
+ id: base_image
38
+ run: |
39
+ BASE_IMAGE_FULL=$(grep '^FROM' ./docker/ascend/Dockerfile.ascend_8.5.0_a2 | head -1 | cut -d' ' -f2)
40
+ echo "Base image full: $BASE_IMAGE_FULL"
41
+ BASE_IMAGE_TAG=$(echo "$BASE_IMAGE_FULL" | cut -d':' -f2)
42
+ echo "Base image tag: $BASE_IMAGE_TAG"
43
+ NEW_IMAGE_NAME="verl-$BASE_IMAGE_TAG"
44
+ echo "New image name: $NEW_IMAGE_NAME"
45
+ echo "base_image_tag=$BASE_IMAGE_TAG" >> "$GITHUB_OUTPUT"
46
+ echo "new_image_name=$NEW_IMAGE_NAME" >> "$GITHUB_OUTPUT"
47
+
48
+ - name: Get image tag
49
+ id: version
50
+ run: |
51
+ BRANCH_NAME=$(echo "${{ github.ref }}" | sed 's/refs\/heads\///g' | sed 's/[^a-zA-Z0-9._-]/_/g')
52
+ if [ "${{ github.event_name }}" = "release" ]; then
53
+ echo "tag=${{ steps.base_image.outputs.new_image_name }}-${{ github.event.release.tag_name }}" >> "$GITHUB_OUTPUT"
54
+ elif [ "$BRANCH_NAME" = "main" ]; then
55
+ echo "tag=${{ steps.base_image.outputs.new_image_name }}-latest" >> "$GITHUB_OUTPUT"
56
+ fi
57
+
58
+ - name: Set up Docker Buildx
59
+ uses: docker/setup-buildx-action@v3
60
+
61
+ - name: Login to Quay.io
62
+ uses: docker/login-action@v3
63
+ with:
64
+ registry: quay.io
65
+ username: ${{ secrets.QUAY_USERNAME }}
66
+ password: ${{ secrets.QUAY_PASSWORD }}
67
+
68
+ - name: Clean Docker cache before build
69
+ run: |
70
+ docker system prune -a -f --volumes || true
71
+
72
+ - name: Build and push images Quay
73
+ uses: docker/build-push-action@v6
74
+ with:
75
+ context: .
76
+ platforms: linux/amd64,linux/arm64
77
+ file: ./docker/ascend/Dockerfile.ascend_8.5.0_a2
78
+ push: true
79
+ tags: |
80
+ quay.io/ascend/verl:${{ steps.version.outputs.tag }}
81
+ cache-from: type=gha
82
+ cache-to: type=gha,mode=max
83
+ build-args: |
84
+ BUILDKIT_INLINE_CACHE=1
.github/workflows/docker-build-ascend-a3.yml ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: docker-build-ascend-a3
2
+
3
+ on:
4
+ workflow_dispatch:
5
+ push:
6
+ branches: ["main"]
7
+ paths:
8
+ - "docker/ascend/Dockerfile.ascend_8.5.0_a3"
9
+ - ".github/workflows/docker-build-ascend-a3.yml"
10
+ release:
11
+ types: [published]
12
+ schedule:
13
+ - cron: "0 19 * * *"
14
+
15
+ jobs:
16
+ build-ascend-image-a3:
17
+ if: ${{ github.event_name != 'pull_request' && github.repository_owner == 'verl-project' }}
18
+ runs-on: ubuntu-latest
19
+ concurrency:
20
+ group: ${{ github.workflow }}-${{ github.ref }}-build-ascend-image-a3
21
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
22
+ steps:
23
+ - name: Remove unnecessary parts in github actions runners to free up disk space
24
+ uses: jlumbroso/free-disk-space@v1.3.1
25
+ with:
26
+ tool-cache: true
27
+
28
+ - name: Checkout code
29
+ uses: actions/checkout@v4
30
+
31
+ - name: Set up Python
32
+ uses: actions/setup-python@v5
33
+ with:
34
+ python-version: "3.11"
35
+
36
+ - name: Get base image name and tag
37
+ id: base_image
38
+ run: |
39
+ BASE_IMAGE_FULL=$(grep '^FROM' ./docker/ascend/Dockerfile.ascend_8.5.0_a3 | head -1 | cut -d' ' -f2)
40
+ echo "Base image full: $BASE_IMAGE_FULL"
41
+ BASE_IMAGE_TAG=$(echo "$BASE_IMAGE_FULL" | cut -d':' -f2)
42
+ echo "Base image tag: $BASE_IMAGE_TAG"
43
+ NEW_IMAGE_NAME="verl-$BASE_IMAGE_TAG"
44
+ echo "New image name: $NEW_IMAGE_NAME"
45
+ echo "base_image_tag=$BASE_IMAGE_TAG" >> "$GITHUB_OUTPUT"
46
+ echo "new_image_name=$NEW_IMAGE_NAME" >> "$GITHUB_OUTPUT"
47
+
48
+ - name: Get image tag
49
+ id: version
50
+ run: |
51
+ BRANCH_NAME=$(echo "${{ github.ref }}" | sed 's/refs\/heads\///g' | sed 's/[^a-zA-Z0-9._-]/_/g')
52
+ if [ "${{ github.event_name }}" = "release" ]; then
53
+ echo "tag=${{ steps.base_image.outputs.new_image_name }}-${{ github.event.release.tag_name }}" >> "$GITHUB_OUTPUT"
54
+ elif [ "$BRANCH_NAME" = "main" ]; then
55
+ echo "tag=${{ steps.base_image.outputs.new_image_name }}-latest" >> "$GITHUB_OUTPUT"
56
+ fi
57
+
58
+ - name: Set up Docker Buildx
59
+ uses: docker/setup-buildx-action@v3
60
+
61
+ - name: Login to Quay.io
62
+ uses: docker/login-action@v3
63
+ with:
64
+ registry: quay.io
65
+ username: ${{ secrets.QUAY_USERNAME }}
66
+ password: ${{ secrets.QUAY_PASSWORD }}
67
+
68
+ - name: Clean Docker cache before build
69
+ run: |
70
+ docker system prune -a -f --volumes || true
71
+
72
+ - name: Build and push images Quay
73
+ uses: docker/build-push-action@v6
74
+ with:
75
+ context: .
76
+ platforms: linux/amd64,linux/arm64
77
+ file: ./docker/ascend/Dockerfile.ascend_8.5.0_a3
78
+ push: true
79
+ tags: |
80
+ quay.io/ascend/verl:${{ steps.version.outputs.tag }}
81
+ cache-from: type=gha
82
+ cache-to: type=gha,mode=max
83
+ build-args: |
84
+ BUILDKIT_INLINE_CACHE=1
.github/workflows/e2e_ascend.yml ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_ascend
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ push:
38
+ branches:
39
+ - main
40
+ - v0.*
41
+ pull_request:
42
+ branches:
43
+ - main
44
+ paths:
45
+ - ".github/workflows/e2e_ascend.yml"
46
+ - "examples/data_preprocess/**"
47
+ - "examples/grpo_trainer/**"
48
+ - "examples/ppo_trainer/**"
49
+ - "examples/sft/**"
50
+ - "verl/experimental/one_step_off_policy/**"
51
+ - "tests/special_npu/**"
52
+ - "tests/special_sanity/check_device_api_usage.py"
53
+ - "verl/**"
54
+ - "pyproject.toml"
55
+ - "requirements-npu.txt"
56
+ - "setup.py"
57
+
58
+ # Cancel jobs on the same ref if a new one is triggered
59
+ concurrency:
60
+ group: ${{ github.workflow }}-${{ github.ref }}
61
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
62
+
63
+ permissions:
64
+ contents: read
65
+
66
+ jobs:
67
+ llm_rl_job:
68
+ if: github.repository_owner == 'verl-project'
69
+ name: E2E Ascend testing for RL training scenarios of LLM models
70
+ runs-on: linux-aarch64-a2b3-8
71
+ timeout-minutes: 120
72
+ container:
73
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
74
+ options: >-
75
+ --shm-size 16g
76
+ env:
77
+ HF_ENDPOINT: "https://hf-mirror.com"
78
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
79
+ steps:
80
+ - name: Check npu and CANN info
81
+ run: |
82
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
83
+ npu-smi info
84
+ - name: Check initial pip list from image
85
+ run: |
86
+ pip list
87
+ - name: Checkout volcengine/verl repo
88
+ uses: actions/checkout@v4
89
+ with:
90
+ fetch-depth: 0
91
+ clean: true
92
+ - name: Install the current repository
93
+ run: |
94
+ pip install -r requirements-npu.txt
95
+ pip install -e .
96
+ - name: Check final pip list
97
+ run: |
98
+ pip list
99
+ - name: Preprocess gsm8k dataset
100
+ run: |
101
+ python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
102
+ - name: Running gsm8k e2e training tests with PPO on ASCEND NPU (FSDP backend)
103
+ run: |
104
+ ray stop --force
105
+ bash tests/special_npu/run_qwen3_06b_ppo.sh
106
+ rm -rf $HOME/ckpts
107
+ - name: Running gsm8k e2e training tests with GRPO on ASCEND NPU (FSDP backend)
108
+ run: |
109
+ ray stop --force
110
+ bash tests/special_npu/run_qwen2_5_05b_grpo.sh
111
+ rm -rf $HOME/ckpts
112
+ - name: Running gsm8k e2e training tests with GRPO on ASCEND NPU (MindSpeed backend)
113
+ run: |
114
+ ray stop --force
115
+ USE_DIST_CKPT=True bash tests/special_npu/run_qwen2_5_05b_grpo_mindspeed.sh
116
+ rm -rf $HOME/dist_ckpt/qwen2_5_05b_grpo_mindspeed
117
+ rm -rf $HOME/ckpts
118
+ - name: Running gsm8k e2e training tests with GRPO on ASCEND NPU (MindSpeed backend, MoE Model)
119
+ run: |
120
+ ray stop --force
121
+ USE_DIST_CKPT=True USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen3moe_minimal.json DUMMY_MODEL_PATH=$HOME/dist_ckpt/qwen3_30b_grpo_mindspeed bash tests/special_npu/run_qwen3_30b_grpo_mindspeed.sh
122
+ - name: Running the E2E test with fully_async_policy algorithm (FSDP2)
123
+ run: |
124
+ ray stop --force
125
+ bash tests/special_npu/run_fully_async_policy.sh
126
+
127
+ vlm_rl_job:
128
+ if: github.repository_owner == 'verl-project'
129
+ name: E2E Ascend testing for RL training scenarios of VLM models
130
+ runs-on: linux-aarch64-a2b3-8
131
+ timeout-minutes: 120
132
+ container:
133
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
134
+ options: >-
135
+ --shm-size 16g
136
+ env:
137
+ HF_ENDPOINT: "https://hf-mirror.com"
138
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
139
+ steps:
140
+ - name: Check npu and CANN info
141
+ run: |
142
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
143
+ npu-smi info
144
+ - name: Check initial pip list from image
145
+ run: |
146
+ pip list
147
+ - name: Checkout volcengine/verl repo
148
+ uses: actions/checkout@v4
149
+ with:
150
+ fetch-depth: 0
151
+ clean: true
152
+ - name: Install the current repository
153
+ run: |
154
+ pip install -r requirements-npu.txt
155
+ pip install -e .
156
+ - name: Check final pip list
157
+ run: |
158
+ pip list
159
+ - name: Preprocess geo3k dataset
160
+ run: |
161
+ python examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/.cache/datasets/hiyouga/geometry3k
162
+ - name: Running geo3k e2e training tests with GRPO on ASCEND NPU
163
+ run: |
164
+ ray stop --force
165
+ bash tests/special_npu/run_qwen2_5_vl_3b_npu.sh
166
+ rm -rf $HOME/ckpts
.github/workflows/e2e_fully_async_policy.yml ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_fully_async_policy
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ - "!**/*.md"
46
+ - "!**/*.sh"
47
+ # Other entrypoints
48
+ - "!examples/*trainer*"
49
+ - "!tests/**"
50
+ - "!verl/trainer/main_*.py"
51
+ - "!verl/trainer/fsdp_sft_trainer.py"
52
+ - "verl/experimental/fully_async_policy"
53
+ pull_request:
54
+ branches:
55
+ - main
56
+ - v0.*
57
+ paths:
58
+ - "**/*.py"
59
+ - "!**/*.md"
60
+ - "!**/*.sh"
61
+ # Other entrypoints
62
+ - "!examples/**"
63
+ - "!tests/**"
64
+ - "!verl/trainer/main_*.py"
65
+ - "!verl/trainer/fsdp_sft_trainer.py"
66
+ # Home
67
+ - "verl/experimental/fully_async_policy"
68
+ # Entrypoints
69
+ - ".github/workflows/e2e_fully_async_policy.yml"
70
+ - "examples/data_preprocess/gsm8k.py"
71
+ - "tests/special_e2e/run_fully_async_policy.sh"
72
+
73
+ # Cancel jobs on the same ref if a new one is triggered
74
+ concurrency:
75
+ group: ${{ github.workflow }}-${{ github.ref }}
76
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
77
+
78
+ # Declare permissions just read content.
79
+ permissions:
80
+ contents: read
81
+
82
+ env:
83
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
84
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
85
+
86
+ jobs:
87
+ setup:
88
+ if: github.repository_owner == 'verl-project'
89
+ runs-on: ubuntu-latest
90
+ outputs:
91
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
92
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
93
+ steps:
94
+ - uses: actions/checkout@v4
95
+ - id: create-runner
96
+ uses: volcengine/vemlp-github-runner@v1
97
+ with:
98
+ mode: "create"
99
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
100
+ mlp-image: "${{ env.IMAGE }}"
101
+
102
+ # Test FSDP2 strategy
103
+ e2e_fully_async_policy_fsdp2:
104
+ needs: setup
105
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
106
+ timeout-minutes: 10 # Increase timeout for async training
107
+ env:
108
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
109
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
110
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
111
+ HF_ENDPOINT: "https://hf-mirror.com"
112
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
113
+ ACTOR_STRATEGY: "fsdp2"
114
+ steps:
115
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
116
+ with:
117
+ fetch-depth: 0
118
+ - name: Install the current repository
119
+ run: |
120
+ pip3 install -r requirements-test.txt
121
+ pip3 install --no-deps -e .
122
+ pip3 install cupy-cuda12x==13.6.0
123
+ - name: Prepare GSM8K dataset
124
+ run: |
125
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
126
+ - name: Running the E2E test with fully_async_policy algorithm (FSDP2)
127
+ run: |
128
+ ray stop --force
129
+ bash tests/special_e2e/run_fully_async_policy.sh
130
+
131
+ # Test Megatron strategy
132
+ e2e_fully_async_policy_megatron:
133
+ needs: setup
134
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
135
+ timeout-minutes: 10 # Increase timeout for async training
136
+ env:
137
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
138
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
139
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
140
+ HF_ENDPOINT: "https://hf-mirror.com"
141
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
142
+ ACTOR_STRATEGY: "megatron"
143
+ steps:
144
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
145
+ with:
146
+ fetch-depth: 0
147
+ - name: Install the current repository
148
+ run: |
149
+ pip3 install -r requirements-test.txt
150
+ pip3 install --no-deps -e .
151
+ pip3 install cupy-cuda12x==13.6.0
152
+ - name: Prepare GSM8K dataset
153
+ run: |
154
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
155
+ - name: Running the E2E test with fully_async_policy algorithm (Megatron)
156
+ run: |
157
+ ray stop --force
158
+ bash tests/special_e2e/run_fully_async_policy.sh
159
+
160
+ cleanup:
161
+ runs-on: ubuntu-latest
162
+ needs: [setup, e2e_fully_async_policy_fsdp2]
163
+ if: always()
164
+ steps:
165
+ - id: destroy-runner
166
+ uses: volcengine/vemlp-github-runner@v1
167
+ with:
168
+ mode: "destroy"
169
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
170
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_one_step_off_policy.yml ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_one_step_off_policy
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ - "!**/*.md"
46
+ - "!**/*.sh"
47
+ # Other entrypoints
48
+ - "!examples/*trainer*"
49
+ - "!tests/**"
50
+ - "!verl/trainer/main_*.py"
51
+ - "!verl/trainer/fsdp_sft_trainer.py"
52
+ - "verl/experimental/one_step_off_policy"
53
+ pull_request:
54
+ branches:
55
+ - main
56
+ - v0.*
57
+ paths:
58
+ - "**/*.py"
59
+ - "!**/*.md"
60
+ - "!**/*.sh"
61
+ # Other entrypoints
62
+ - "!examples/**"
63
+ - "!tests/**"
64
+ - "!verl/trainer/main_*.py"
65
+ - "!verl/trainer/fsdp_sft_trainer.py"
66
+ # Home
67
+ - "verl/experimental/one_step_off_policy"
68
+ # Entrypoints
69
+ - ".github/workflows/e2e_one_step_off_policy.yml"
70
+ - "examples/data_preprocess/gsm8k.py"
71
+ - "tests/special_e2e/run_one_step_off_policy.sh"
72
+
73
+ # Cancel jobs on the same ref if a new one is triggered
74
+ concurrency:
75
+ group: ${{ github.workflow }}-${{ github.ref }}
76
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
77
+
78
+ # Declare permissions just read content.
79
+ permissions:
80
+ contents: read
81
+
82
+ env:
83
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
84
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
85
+
86
+ jobs:
87
+ setup:
88
+ if: github.repository_owner == 'verl-project'
89
+ runs-on: ubuntu-latest
90
+ outputs:
91
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
92
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
93
+ steps:
94
+ - uses: actions/checkout@v4
95
+ - id: create-runner
96
+ uses: volcengine/vemlp-github-runner@v1
97
+ with:
98
+ mode: "create"
99
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
100
+ mlp-image: "${{ env.IMAGE }}"
101
+
102
+ # Test FSDP2 strategy
103
+ e2e_one_step_off_policy_fsdp2:
104
+ needs: setup
105
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
106
+ timeout-minutes: 10 # Increase timeout for async training
107
+ env:
108
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
109
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
110
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
111
+ HF_ENDPOINT: "https://hf-mirror.com"
112
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
113
+ ACTOR_STRATEGY: "fsdp2"
114
+ steps:
115
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
116
+ with:
117
+ fetch-depth: 0
118
+ - name: Install the current repository
119
+ run: |
120
+ pip3 install -r requirements-test.txt
121
+ pip3 install --no-deps -e .
122
+ pip3 install cupy-cuda12x==13.6.0
123
+ - name: Prepare GSM8K dataset
124
+ run: |
125
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
126
+ - name: Running the E2E test with one_step_off_policy algorithm (FSDP2)
127
+ run: |
128
+ ray stop --force
129
+ bash tests/special_e2e/run_one_step_off_policy.sh
130
+
131
+ # Test Megatron strategy
132
+ e2e_one_step_off_policy_megatron:
133
+ needs: setup
134
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
135
+ timeout-minutes: 10 # Increase timeout for async training
136
+ env:
137
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
138
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
139
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
140
+ HF_ENDPOINT: "https://hf-mirror.com"
141
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
142
+ ACTOR_STRATEGY: "megatron"
143
+ steps:
144
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
145
+ with:
146
+ fetch-depth: 0
147
+ - name: Install the current repository
148
+ run: |
149
+ pip3 install -r requirements-test.txt
150
+ pip3 install --no-deps -e .
151
+ pip3 install cupy-cuda12x==13.6.0
152
+ - name: Prepare GSM8K dataset
153
+ run: |
154
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
155
+ - name: Running the E2E test with one_step_off_policy algorithm (Megatron)
156
+ run: |
157
+ ray stop --force
158
+ bash tests/special_e2e/run_one_step_off_policy.sh
159
+
160
+ cleanup:
161
+ runs-on: ubuntu-latest
162
+ needs:
163
+ [setup, e2e_one_step_off_policy_fsdp2, e2e_one_step_off_policy_megatron]
164
+ if: always()
165
+ steps:
166
+ - id: destroy-runner
167
+ uses: volcengine/vemlp-github-runner@v1
168
+ with:
169
+ mode: "destroy"
170
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
171
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_one_step_off_policy_ascend.yml ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_one_step_off_policy_ascend
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ - "!**/*.md"
46
+ - "!**/*.sh"
47
+ # Other entrypoints
48
+ - "!examples/*trainer*"
49
+ - "!tests/**"
50
+ - "!verl/trainer/main_*.py"
51
+ - "!verl/trainer/fsdp_sft_trainer.py"
52
+ - "verl/experimental/one_step_off_policy"
53
+ pull_request:
54
+ branches:
55
+ - main
56
+ - v0.*
57
+ paths:
58
+ - "**/*.py"
59
+ - "!**/*.md"
60
+ - "!**/*.sh"
61
+ # Other entrypoints
62
+ - "!examples/**"
63
+ - "!tests/**"
64
+ - "!verl/trainer/main_*.py"
65
+ - "!verl/trainer/fsdp_sft_trainer.py"
66
+ # Home
67
+ - "verl/experimental/one_step_off_policy"
68
+ # Entrypoints
69
+ - ".github/workflows/e2e_one_step_off_policy_ascend.yml"
70
+ - "examples/data_preprocess/gsm8k.py"
71
+ - "tests/special_npu/run_one_step_off_policy.sh"
72
+
73
+ # Cancel jobs on the same ref if a new one is triggered
74
+ concurrency:
75
+ group: ${{ github.workflow }}-${{ github.ref }}
76
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
77
+
78
+ # Declare permissions just read content.
79
+ permissions:
80
+ contents: read
81
+
82
+ jobs:
83
+ # Test FSDP2 strategy
84
+ e2e_one_step_off_policy_fsdp2_ascend:
85
+ if: github.repository_owner == 'verl-project'
86
+ runs-on: linux-aarch64-a2b3-8
87
+ timeout-minutes: 60 # Increase this timeout value as needed
88
+ container:
89
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
90
+ options: >-
91
+ --shm-size 16g
92
+ env:
93
+ HF_ENDPOINT: "https://hf-mirror.com"
94
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
95
+ ACTOR_STRATEGY: "fsdp2"
96
+ steps:
97
+ - name: Check npu and CANN info
98
+ run: |
99
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
100
+ npu-smi info
101
+ - name: Check initial pip list from image
102
+ run: |
103
+ pip list
104
+ - name: Checkout verl-project/verl repo
105
+ uses: actions/checkout@v4
106
+ with:
107
+ fetch-depth: 0
108
+ clean: true
109
+ - name: Install the current repository
110
+ run: |
111
+ pip install -r requirements-npu.txt
112
+ pip install --no-deps -e .
113
+ - name: Check final pip list
114
+ run: |
115
+ pip list
116
+ - name: Prepare weights
117
+ run: |
118
+ ln -s /root/.cache/models ~/models
119
+ - name: Prepare GSM8K dataset
120
+ run: |
121
+ python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
122
+ - name: Running the E2E test with one_step_off_policy algorithm (FSDP2)
123
+ run: |
124
+ ray stop --force
125
+ bash tests/special_npu/run_one_step_off_policy.sh
126
+
127
+ # Test Megatron strategy
128
+ e2e_one_step_off_policy_megatron_ascend:
129
+ if: github.repository_owner == 'verl-project'
130
+ runs-on: linux-aarch64-a2b3-8
131
+ timeout-minutes: 60 # Increase this timeout value as needed
132
+ container:
133
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
134
+ options: >-
135
+ --shm-size 16g
136
+ env:
137
+ HF_ENDPOINT: "https://hf-mirror.com"
138
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
139
+ ACTOR_STRATEGY: "megatron"
140
+ steps:
141
+ - name: Check npu and CANN info
142
+ run: |
143
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
144
+ npu-smi info
145
+ - name: Check initial pip list from image
146
+ run: |
147
+ pip list
148
+ - name: Checkout verl-project/verl repo
149
+ uses: actions/checkout@v4
150
+ with:
151
+ fetch-depth: 0
152
+ clean: true
153
+ - name: Install the current repository
154
+ run: |
155
+ pip install -r requirements-npu.txt
156
+ pip install --no-deps -e .
157
+ - name: Check final pip list
158
+ run: |
159
+ pip list
160
+ - name: Prepare weights
161
+ run: |
162
+ ln -s /root/.cache/models ~/models
163
+ - name: Prepare GSM8K dataset
164
+ run: |
165
+ python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
166
+ - name: Running the E2E test with one_step_off_policy algorithm (Megatron)
167
+ run: |
168
+ ray stop --force
169
+ bash tests/special_npu/run_one_step_off_policy.sh
.github/workflows/e2e_ppo_grpo_trainer_trtllm.yml ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_ppo_trainer_megatron_trtllm
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch.
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ # Other entrypoints
46
+ - "!verl/trainer/fsdp_sft_trainer.py"
47
+ # Recipes
48
+ - "!recipe/**"
49
+ # FSDP
50
+ - "!verl/workers/**/*dp_*.py"
51
+ pull_request:
52
+ branches:
53
+ - main
54
+ - v0.*
55
+ paths:
56
+ - "**/*.py"
57
+ # Other entrypoints
58
+ - "!docker/**"
59
+ # Docs
60
+ - "!**/*.md"
61
+ - "!docs/**"
62
+ - "!examples/**"
63
+ - "!tests/**"
64
+ - "!verl/trainer/main_*.py"
65
+ - "!verl/trainer/fsdp_sft_trainer.py"
66
+ # Recipes
67
+ - "!recipe/**"
68
+ # FSDP
69
+ - "!verl/workers/**/*dp_*.py"
70
+ # Entrypoints
71
+ - "verl/workers/rollout/trtllm_rollout/**"
72
+ - "tests/workers/rollout/rollout_trtllm/**"
73
+ - ".github/workflows/e2e_ppo_grpo_trainer_trtllm.yml"
74
+ - "examples/data_preprocess/gsm8k.py"
75
+ - "examples/data_preprocess/geo3k.py"
76
+ - "examples/data_preprocess/dapo_multiturn_w_tool.py"
77
+ - "examples/data_preprocess/aime2024_multiturn_w_tool.py"
78
+ - "examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh"
79
+ - "examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh"
80
+ - "examples/grpo_trainer/run_qwen3-30b_dapo_megatron_fp8_trtllm.sh"
81
+ # add back when ppo flow is ready
82
+ # - "tests/special_e2e/run_ppo_trainer_megatron.sh"
83
+ # - "verl/trainer/main_ppo.py"
84
+ # - "verl/trainer/config/ppo_megatron_trainer.yaml"
85
+
86
+ # Cancel jobs on the same ref if a new one is triggered
87
+ concurrency:
88
+ group: ${{ github.workflow }}-${{ github.ref }}
89
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
90
+
91
+ # Declare permissions just read content.
92
+ permissions:
93
+ contents: read
94
+
95
+ env:
96
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:trtllm1.3.0rc4"
97
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
98
+
99
+ jobs:
100
+ setup:
101
+ if: github.repository_owner == 'verl-project'
102
+ runs-on: ubuntu-latest
103
+ outputs:
104
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
105
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
106
+ steps:
107
+ - uses: actions/checkout@v4
108
+ - id: create-runner
109
+ uses: volcengine/vemlp-github-runner@v1
110
+ with:
111
+ mode: "create"
112
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
113
+ mlp-image: "${{ env.IMAGE }}"
114
+
115
+ trtllm_unit_tests:
116
+ needs: setup
117
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
118
+ timeout-minutes: 30 # Increase this timeout value as needed
119
+ env:
120
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
121
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
122
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
123
+ HF_ENDPOINT: "https://hf-mirror.com"
124
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
125
+ steps:
126
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
127
+ with:
128
+ fetch-depth: 0
129
+ - name: Install the current repository
130
+ run: |
131
+ pip3 install pytest-asyncio
132
+ pip3 install -r requirements-test.txt
133
+ pip3 install --no-deps -e .
134
+ - name: Run TRTLLM unit tests
135
+ run: |
136
+ export TRTLLM_TEST_MODEL_PATH_ROOT="${HOME}/models"
137
+ ray stop --force
138
+ pytest -v -s \
139
+ tests/workers/rollout/rollout_trtllm/test_adapter.py \
140
+ tests/workers/rollout/rollout_trtllm/test_async_server.py \
141
+ tests/workers/rollout/rollout_trtllm/test_trtllm_rollout_utils.py
142
+
143
+ e2e_grpo_trainer_fsdp-qwen2:
144
+ needs: setup
145
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
146
+ timeout-minutes: 30 # Increase this timeout value as needed
147
+ env:
148
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
149
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
150
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
151
+ HF_ENDPOINT: "https://hf-mirror.com"
152
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
153
+ steps:
154
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
155
+ with:
156
+ fetch-depth: 0
157
+ - name: Install the current repository
158
+ run: |
159
+ pip3 install -r requirements-test.txt
160
+ pip3 install --no-deps -e .
161
+ - name: Prepare GSM8K dataset
162
+ run: |
163
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_save_dir ${PWD}/data/gsm8k
164
+ - name: Running GSM8K E2E training tests with FSDP on 8 L20 GPUs (Qwen)
165
+ run: |
166
+ ray stop --force
167
+ DATADIR=${HOME}/data \
168
+ bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 2 \
169
+ trainer.total_training_steps=1 \
170
+ data.train_files="['${PWD}/data/gsm8k/train.parquet']" \
171
+ data.val_files="['${PWD}/data/gsm8k/test.parquet']" \
172
+ trainer.logger='["console"]' \
173
+ actor_rollout_ref.model.path="${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct"
174
+ - name: clean up
175
+ run: |
176
+ rm -rf checkpoints
177
+
178
+ e2e_grpo_trainer_megatron-qwen2:
179
+ needs: setup
180
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
181
+ timeout-minutes: 30 # Increase this timeout value as needed
182
+ env:
183
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
184
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
185
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
186
+ HF_ENDPOINT: "https://hf-mirror.com"
187
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
188
+ steps:
189
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
190
+ with:
191
+ fetch-depth: 0
192
+ - name: Install the current repository
193
+ run: |
194
+ pip3 install -r requirements-test.txt
195
+ pip3 install --no-deps -e .
196
+ - name: Prepare GSM8K dataset
197
+ run: |
198
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_save_dir ${PWD}/data/gsm8k
199
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
200
+ run: |
201
+ ray stop --force
202
+ DATADIR=${HOME}/data \
203
+ ACTOR_TP=2 \
204
+ bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 2 \
205
+ trainer.total_training_steps=1 \
206
+ data.train_files="['${PWD}/data/gsm8k/train.parquet']" \
207
+ data.val_files="['${PWD}/data/gsm8k/test.parquet']" \
208
+ trainer.logger='["console"]' \
209
+ actor_rollout_ref.model.path="${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct"
210
+ - name: clean up
211
+ run: |
212
+ rm -rf checkpoints
213
+ e2e_grpo_trainer_fsdp-vlm:
214
+ needs: setup
215
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
216
+ timeout-minutes: 30 # Increase this timeout value as needed
217
+ env:
218
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
219
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
220
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
221
+ HF_ENDPOINT: "https://hf-mirror.com"
222
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
223
+ steps:
224
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
225
+ with:
226
+ fetch-depth: 0
227
+ - name: Install the current repository
228
+ run: |
229
+ pip3 install -r requirements-test.txt
230
+ pip3 install --no-deps -e .
231
+ - name: Prepare GEO3K dataset
232
+ run: |
233
+ python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/geo3k --local_save_dir ${PWD}/data/geo3k
234
+ - name: Running GEO3K E2E training tests with FSDP on 8 L20 GPUs (VLM)
235
+ run: |
236
+ ray stop --force
237
+ DATADIR=${HOME}/data \
238
+ bash examples/grpo_trainer/run_qwen2_5_vl_3b_trtllm.sh 2 \
239
+ trainer.total_training_steps=1 \
240
+ data.train_files="['${PWD}/data/geo3k/train.parquet']" \
241
+ data.val_files="['${PWD}/data/geo3k/test.parquet']" \
242
+ trainer.logger='["console"]' \
243
+ actor_rollout_ref.model.path="${HOME}/models/Qwen/Qwen3-VL-2B-Instruct"
244
+ - name: clean up
245
+ run: |
246
+ rm -rf checkpoints
247
+ - name: Prepare DAPO-Math-17k and AIME-2024 datasets (data_preprocess)
248
+ run: |
249
+ python3 examples/data_preprocess/dapo_multiturn_w_tool.py --local_save_dir ${PWD}/data/dapo-math-17k
250
+ python3 examples/data_preprocess/aime2024_multiturn_w_tool.py --local_save_dir ${PWD}/data/aime-2024
251
+ - name: Running DAPO E2E with FP8 TRT-LLM rollout (Qwen3-0.6B)
252
+ run: |
253
+ ray stop --force
254
+ export INFER_TP=2 ACTOR_TP=2 ACTOR_PP=2 ACTOR_VPP=2 ACTOR_EP=1 ACTOR_CP=2 REF_TP=2 REF_PP=2 REF_VPP=2 REF_EP=1 REF_CP=2 GEN_MOE_TP=null GEN_MOE_EP=null
255
+ export NNODES=1 GPUS_PER_NODE=8 TRTLLM_MOE_BACKEND=CUTLASS
256
+ export DATA_DIR=${PWD} DAPO_MATH_TRAIN=${PWD}/data/dapo-math-17k/train.parquet AIME_VAL=${PWD}/data/aime-2024/train.parquet MODEL_PATH=${HOME}/models/Qwen/Qwen3-0.6B
257
+ bash examples/grpo_trainer/run_qwen3-30b_dapo_megatron_fp8_trtllm.sh \
258
+ reward_model.reward_kwargs.overlong_buffer_cfg.len=258 \
259
+ reward_model.reward_kwargs.max_resp_len=512 \
260
+ data.max_prompt_length=512 \
261
+ data.max_response_length=512 \
262
+ data.train_batch_size=32 \
263
+ actor_rollout_ref.rollout.n=4 \
264
+ actor_rollout_ref.rollout.max_num_seqs=16 \
265
+ actor_rollout_ref.rollout.max_num_batched_tokens=1024 \
266
+ actor_rollout_ref.rollout.max_model_len=1024 \
267
+ actor_rollout_ref.actor.megatron.override_transformer_config.moe_grouped_gemm=False \
268
+ actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=False \
269
+ trainer.total_training_steps=1 \
270
+ trainer.logger='["console"]'
271
+ - name: clean up
272
+ run: |
273
+ rm -rf checkpoints
274
+
275
+ cleanup:
276
+ runs-on: ubuntu-latest
277
+ needs: [setup, trtllm_unit_tests, e2e_grpo_trainer_fsdp-qwen2, e2e_grpo_trainer_megatron-qwen2, e2e_grpo_trainer_fsdp-vlm]
278
+ if: always()
279
+ steps:
280
+ - id: destroy-runner
281
+ uses: volcengine/vemlp-github-runner@v1
282
+ with:
283
+ mode: "destroy"
284
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
285
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_ppo_trainer.yml ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: e2e_ppo_trainer
2
+
3
+ on:
4
+ # Trigger the workflow on push or pull request,
5
+ # but only for the main branch
6
+ # For push, for now only anti-patterns are specified so it is more conservative
7
+ # and achieves higher coverage.
8
+ push:
9
+ branches:
10
+ - main
11
+ - v0.*
12
+ paths:
13
+ - "**/*.py"
14
+ # Other entrypoints
15
+ - "!verl/trainer/fsdp_sft_trainer.py"
16
+
17
+ # Megatron
18
+ - "!verl/workers/**/megatron_*.py"
19
+
20
+ pull_request:
21
+ branches:
22
+ - main
23
+ - v0.*
24
+ paths:
25
+ - "**/*.py"
26
+ # Other entrypoints
27
+ - "!**/*.md"
28
+ - "!docker/**"
29
+ - "!examples/**"
30
+ - "!tests/**"
31
+ - "!verl/trainer/main_*.py"
32
+ - "!verl/trainer/fsdp_sft_trainer.py"
33
+ # Docs
34
+ - "!docs/**"
35
+
36
+ # Megatron
37
+ - "!verl/workers/**/megatron_*.py"
38
+ # Entrypoints
39
+ - ".github/workflows/e2e_ppo_trainer.yml"
40
+ - "examples/data_preprocess/gsm8k.py"
41
+ - "examples/data_preprocess/geo3k.py"
42
+ - "tests/special_e2e/ppo_trainer"
43
+ - "verl/trainer/main_ppo.py"
44
+ - "verl/trainer/config/ppo_trainer.yaml"
45
+
46
+ # Cancel jobs on the same ref if a new one is triggered
47
+ concurrency:
48
+ group: ${{ github.workflow }}-${{ github.ref }}
49
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
50
+
51
+ # Declare permissions just read content.
52
+ permissions:
53
+ contents: read
54
+
55
+ jobs:
56
+ pre_commit_for_ppo:
57
+ runs-on: ubuntu-latest
58
+ strategy:
59
+ matrix:
60
+ python-version: ["3.12"]
61
+ steps:
62
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
63
+ - name: Set up Python ${{ matrix.python-version }}
64
+ uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
65
+ with:
66
+ python-version: ${{ matrix.python-version }}
67
+ - name: Install the current repository
68
+ run: |
69
+ pip install pre-commit hydra-core
70
+ pip3 install --no-deps -e .
71
+ - name: Set ruff --output-format=github
72
+ run: |
73
+ sed -i 's/--output-format=full/--output-format=github/' .pre-commit-config.yaml
74
+ git add .pre-commit-config.yaml
75
+ - uses: pre-commit/action@v3.0.1
76
+ with:
77
+ extra_args: "" # Overriding default "--all-files"
78
+
.github/workflows/e2e_ppo_trainer_megatron_sglang.yml ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_ppo_trainer_megatron_sglang
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch.
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ # Other entrypoints
46
+ - "!verl/trainer/fsdp_sft_trainer.py" # FSDP
47
+ - "!verl/workers/**/*dp_*.py"
48
+ - "!verl/utils/fsdp_utils.py"
49
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
50
+ - "!verl/model_merger/fsdp_model_merger.py"
51
+ pull_request:
52
+ branches:
53
+ - main
54
+ - v0.*
55
+ paths:
56
+ - "**/*.py"
57
+ # Other entrypoints
58
+ - "!docker/**"
59
+ # Docs
60
+ - "!**/*.md"
61
+ - "!docs/**"
62
+ - "!examples/**"
63
+ - "!tests/**"
64
+ - "!verl/trainer/main_*.py"
65
+ - "!verl/trainer/fsdp_sft_trainer.py" # FSDP
66
+ - "!verl/workers/**/*dp_*.py"
67
+ - "!verl/utils/fsdp_utils.py"
68
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
69
+ - "!verl/model_merger/fsdp_model_merger.py"
70
+ # Entrypoints
71
+ - "verl/worksers/rollout/sglang_rollout/*"
72
+ - ".github/workflows/e2e_ppo_trainer_megatron_sglang.yml"
73
+ - "examples/data_preprocess/gsm8k.py"
74
+ - "examples/data_preprocess/geo3k.py"
75
+ - "tests/special_e2e/run_ppo_trainer_megatron.sh"
76
+ - "verl/trainer/main_ppo.py"
77
+ - "verl/trainer/config/ppo_megatron_trainer.yaml"
78
+
79
+ # Cancel jobs on the same ref if a new one is triggered
80
+ concurrency:
81
+ group: ${{ github.workflow }}-${{ github.ref }}
82
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
83
+
84
+ # Declare permissions just read content.
85
+ permissions:
86
+ contents: read
87
+
88
+ env:
89
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
90
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
91
+
92
+ jobs:
93
+ setup:
94
+ if: github.repository_owner == 'verl-project'
95
+ runs-on: ubuntu-latest
96
+ outputs:
97
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
98
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
99
+ steps:
100
+ - uses: actions/checkout@v4
101
+ - id: create-runner
102
+ uses: volcengine/vemlp-github-runner@v1
103
+ with:
104
+ mode: "create"
105
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
106
+ mlp-image: "${{ env.IMAGE }}"
107
+
108
+ e2e_ppo_trainer_megatron-deepseek:
109
+ needs: setup
110
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
111
+ timeout-minutes: 60 # Increase this timeout value as needed
112
+ env:
113
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
114
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
115
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
116
+ HF_ENDPOINT: "https://hf-mirror.com"
117
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
118
+ ENGINE: sglang
119
+ steps:
120
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
121
+ with:
122
+ fetch-depth: 0
123
+ - name: Install the current repository
124
+ run: |
125
+ pip3 install -r requirements-test.txt
126
+ pip3 install --no-deps -e .
127
+ - name: Prepare GSM8K dataset
128
+ run: |
129
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
130
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
131
+ run: |
132
+ ray stop --force
133
+ OPTIM_MEMORY_EFFICIENT=True ENGINE=sglang SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
134
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
135
+ run: |
136
+ ray stop --force
137
+ export VLLM_USE_V1=1
138
+ ray start --head
139
+ ENGINE=sglang MODE=async RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct TOTAL_TRAIN_STEPS=2 bash tests/special_e2e/run_ppo_trainer_megatron.sh
140
+ - name: Profiling GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Deepseek)
141
+ run: |
142
+ ray stop --force
143
+ PROFILE_ENABLE=True ENGINE=sglang ADV_ESTIMATOR=grpo USE_DYNAMIC_BSZ=False MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
144
+ if [ -z "$( ls -A '/tmp/ray/session_latest/logs/nsight/' )" ]; then
145
+ echo "[ERROR] not found any profiling files"
146
+ exit 1
147
+ else
148
+ echo "[SUCCESS] profile success"
149
+ fi
150
+ - name: clean up
151
+ run: |
152
+ rm -rf checkpoints
153
+
154
+ # Qwen3-0.6B: dense, tie_word_embeddings=True
155
+ e2e_ppo_trainer_megatron-qwen3:
156
+ needs: setup
157
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
158
+ timeout-minutes: 60 # Increase this timeout value as needed
159
+ env:
160
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
161
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
162
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
163
+ HF_ENDPOINT: "https://hf-mirror.com"
164
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
165
+ ENGINE: sglang
166
+ steps:
167
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
168
+ with:
169
+ fetch-depth: 0
170
+ - name: Install the current repository
171
+ run: |
172
+ pip3 install -r requirements-test.txt
173
+ pip3 install --no-deps -e .
174
+ - name: Prepare GSM8K dataset
175
+ run: |
176
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
177
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) testing learning rate scheduler
178
+ run: |
179
+ ray stop --force
180
+ ALL_OFFLOAD=True VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 LR_WARMUP_STEPS=1 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
181
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with FP8 rollout
182
+ run: |
183
+ ray stop --force
184
+ export VLLM_USE_V1=1
185
+ ROLLOUT_QUANTIZATION=fp8 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
186
+ - name: clean up
187
+ run: |
188
+ rm -rf checkpoints
189
+
190
+ cleanup:
191
+ runs-on: ubuntu-latest
192
+ needs:
193
+ [setup, e2e_ppo_trainer_megatron-deepseek, e2e_ppo_trainer_megatron-qwen3]
194
+ if: always()
195
+ steps:
196
+ - id: destroy-runner
197
+ uses: volcengine/vemlp-github-runner@v1
198
+ with:
199
+ mode: "destroy"
200
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
201
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_ppo_trainer_megatron_sglang_2.yml ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_ppo_trainer_megatron_sglang_2
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch.
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ # Other entrypoints
46
+ - "!verl/trainer/fsdp_sft_trainer.py" # FSDP
47
+ - "!verl/workers/**/*dp_*.py"
48
+ - "!verl/utils/fsdp_utils.py"
49
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
50
+ - "!verl/model_merger/fsdp_model_merger.py"
51
+ pull_request:
52
+ branches:
53
+ - main
54
+ - v0.*
55
+ paths:
56
+ - "**/*.py"
57
+ # Other entrypoints
58
+ - "!docker/**"
59
+ # Docs
60
+ - "!**/*.md"
61
+ - "!docs/**"
62
+ - "!examples/**"
63
+ - "!tests/**"
64
+ - "!verl/trainer/main_*.py"
65
+ - "!verl/trainer/fsdp_sft_trainer.py" # FSDP
66
+ - "!verl/workers/**/*dp_*.py"
67
+ - "!verl/utils/fsdp_utils.py"
68
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
69
+ - "!verl/model_merger/fsdp_model_merger.py"
70
+ # Entrypoints
71
+ - "verl/worksers/rollout/sglang_rollout/*"
72
+ - ".github/workflows/e2e_ppo_trainer_megatron_sglang.yml"
73
+ - "examples/data_preprocess/gsm8k.py"
74
+ - "examples/data_preprocess/geo3k.py"
75
+ - "tests/special_e2e/run_ppo_trainer_megatron.sh"
76
+ - "verl/trainer/main_ppo.py"
77
+ - "verl/trainer/config/ppo_megatron_trainer.yaml"
78
+
79
+ # Cancel jobs on the same ref if a new one is triggered
80
+ concurrency:
81
+ group: ${{ github.workflow }}-${{ github.ref }}
82
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
83
+
84
+ # Declare permissions just read content.
85
+ permissions:
86
+ contents: read
87
+
88
+ env:
89
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
90
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
91
+
92
+ jobs:
93
+ setup:
94
+ if: github.repository_owner == 'verl-project'
95
+ runs-on: ubuntu-latest
96
+ outputs:
97
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
98
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
99
+ steps:
100
+ - uses: actions/checkout@v4
101
+ - id: create-runner
102
+ uses: volcengine/vemlp-github-runner@v1
103
+ with:
104
+ mode: "create"
105
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
106
+ mlp-image: "${{ env.IMAGE }}"
107
+
108
+ e2e_ppo_trainer_fsdp_sglang:
109
+ needs: setup
110
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
111
+ timeout-minutes: 40 # Increase this timeout value as needed
112
+ env:
113
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
114
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
115
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
116
+ HF_ENDPOINT: "https://hf-mirror.com"
117
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
118
+ steps:
119
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
120
+ with:
121
+ fetch-depth: 0
122
+ - name: Install the current repository
123
+ run: |
124
+ pip3 install -r requirements-test.txt
125
+ pip3 install --no-deps -e .
126
+ - name: Prepare gsm8k dataset
127
+ run: |
128
+ ray stop --force
129
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
130
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm and save ckpt
131
+ run: |
132
+ ray stop --force
133
+ ENGINE=sglang bash tests/special_e2e/ppo_trainer/run_function_reward.sh
134
+
135
+ e2e_ppo_trainer_fsdp-qwen2_5vl-3b:
136
+ needs: setup
137
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
138
+ timeout-minutes: 60 # Increase this timeout value as needed
139
+ env:
140
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
141
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
142
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
143
+ HF_ENDPOINT: "https://hf-mirror.com"
144
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
145
+ steps:
146
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
147
+ with:
148
+ fetch-depth: 0
149
+ - name: Install the current repository
150
+ run: |
151
+ pip3 install -r requirements-test.txt
152
+ pip3 install --no-deps -e .
153
+ # Geo3k
154
+ - name: Prepare GEO3K dataset
155
+ run: |
156
+ ray stop --force
157
+ python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/hiyouga/geometry3k/
158
+ - name: Running GEO3K VLM E2E training tests on 8 L20 GPUs with rmpad using function rm
159
+ run: |
160
+ ray stop --force
161
+ TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
162
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
163
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
164
+ ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
165
+ ENGINE=sglang ROLLOUT_MODE=async GPU_MEMORY_UTILIZATION=0.6 ACTOR_FSDP_PARAM_OFFLOAD=True \
166
+ ACTOR_FSDP_OPTIMIZER_OFFLOAD=True REF_FSDP_PARAM_OFFLOAD=True \
167
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
168
+ - name: Running GEO3K VLM E2E with rmpad using torch fused kernel (Qwen2.5-VL)
169
+ run: |
170
+ ray stop --force
171
+ FUSED_KERNELS=True TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
172
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
173
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
174
+ ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
175
+ ENGINE=sglang ROLLOUT_MODE=async GPU_MEMORY_UTILIZATION=0.6 ACTOR_FSDP_PARAM_OFFLOAD=True \
176
+ ACTOR_FSDP_OPTIMIZER_OFFLOAD=True REF_FSDP_PARAM_OFFLOAD=True \
177
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
178
+ - name: Running GEO3K VLM E2E with rmpad using triton fused kernel (Qwen2.5-VL)
179
+ run: |
180
+ ray stop --force
181
+ FUSED_KERNELS=True FUSED_KERNEL_BACKEND=triton \
182
+ TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
183
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
184
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
185
+ ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
186
+ ENGINE=sglang ROLLOUT_MODE=async GPU_MEMORY_UTILIZATION=0.6 ACTOR_FSDP_PARAM_OFFLOAD=True \
187
+ ACTOR_FSDP_OPTIMIZER_OFFLOAD=True REF_FSDP_PARAM_OFFLOAD=True \
188
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
189
+
190
+ cleanup:
191
+ runs-on: ubuntu-latest
192
+ needs:
193
+ [setup, e2e_ppo_trainer_fsdp-qwen2_5vl-3b, e2e_ppo_trainer_fsdp_sglang]
194
+ if: always()
195
+ steps:
196
+ - id: destroy-runner
197
+ uses: volcengine/vemlp-github-runner@v1
198
+ with:
199
+ mode: "destroy"
200
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
201
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_ppo_trainer_megatron_vllm.yml ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_ppo_trainer_megatron_vllm
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch.
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ # Other entrypoints
46
+ - "!verl/trainer/fsdp_sft_trainer.py"
47
+ # FSDP
48
+ - "!verl/workers/**/*dp_*.py"
49
+ - "!verl/utils/fsdp_utils.py"
50
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
51
+ - "!verl/model_merger/fsdp_model_merger.py"
52
+ pull_request:
53
+ branches:
54
+ - main
55
+ - v0.*
56
+ paths:
57
+ - "**/*.py"
58
+ # Other entrypoints
59
+ - "!docker/**"
60
+ # Docs
61
+ - "!**/*.md"
62
+ - "!docs/**"
63
+ - "!examples/**"
64
+ - "!tests/**"
65
+ - "!verl/trainer/main_*.py"
66
+ - "!verl/trainer/fsdp_sft_trainer.py"
67
+ # FSDP
68
+ - "!verl/workers/**/*dp_*.py"
69
+ - "!verl/utils/fsdp_utils.py"
70
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
71
+ - "!verl/model_merger/fsdp_model_merger.py"
72
+ # Entrypoints
73
+ - ".github/workflows/e2e_ppo_trainer_megatron_vllm.yml"
74
+ - "examples/data_preprocess/gsm8k.py"
75
+ - "examples/data_preprocess/geo3k.py"
76
+ - "tests/special_e2e/run_ppo_trainer_megatron.sh"
77
+ - "verl/trainer/main_ppo.py"
78
+ - "verl/trainer/config/ppo_megatron_trainer.yaml"
79
+
80
+ # Cancel jobs on the same ref if a new one is triggered
81
+ concurrency:
82
+ group: ${{ github.workflow }}-${{ github.ref }}
83
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
84
+
85
+ # Declare permissions just read content.
86
+ permissions:
87
+ contents: read
88
+
89
+ env:
90
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
91
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
92
+
93
+ jobs:
94
+ setup:
95
+ if: github.repository_owner == 'verl-project'
96
+ runs-on: ubuntu-latest
97
+ outputs:
98
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
99
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
100
+ steps:
101
+ - uses: actions/checkout@v4
102
+ - id: create-runner
103
+ uses: volcengine/vemlp-github-runner@v1
104
+ with:
105
+ mode: "create"
106
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
107
+ mlp-image: "${{ env.IMAGE }}"
108
+
109
+ # deepseek-ai/deepseek-coder-1.3b-instruct: dense, tie_word_embeddings=False
110
+ e2e_ppo_trainer_megatron-deepseek:
111
+ needs: setup
112
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
113
+ timeout-minutes: 60 # Increase this timeout value as needed
114
+ env:
115
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
116
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
117
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
118
+ HF_ENDPOINT: "https://hf-mirror.com"
119
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
120
+ steps:
121
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
122
+ with:
123
+ fetch-depth: 0
124
+ - name: Install the current repository
125
+ run: |
126
+ pip3 install -r requirements-test.txt
127
+ pip3 install --no-deps --force-reinstall .
128
+ pip3 install mbridge
129
+ pip3 install math-verify
130
+ - name: Prepare GSM8K dataset
131
+ run: |
132
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
133
+ # Full training save&load
134
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use mbridge e2e to pre-load and save (Deepseek)
135
+ run: |
136
+ ray stop --force
137
+ ALL_OFFLOAD=True SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 USE_MBRIDGE=True USE_DIST_CKPT=False \
138
+ bash tests/special_e2e/run_ppo_trainer_megatron.sh
139
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use mbridge e2e to pre-load and save (Deepseek)
140
+ run: |
141
+ ray stop --force
142
+ RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct TOTAL_TRAIN_STEPS=2 SAVE_FREQ=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 USE_MBRIDGE=True USE_DIST_CKPT=False \
143
+ bash tests/special_e2e/run_ppo_trainer_megatron.sh
144
+ # LoRA training save&load
145
+ - name: clean up and install Megatron-Bridge
146
+ run: |
147
+ rm -rf checkpoints
148
+ pip3 install git+https://github.com/NVIDIA-NeMo/Megatron-Bridge.git@83a7c11 --no-deps --no-build-isolation
149
+ pip3 install git+https://github.com/NVIDIA/Megatron-LM.git@5455f0a --no-deps --no-build-isolation
150
+ pip3 install "nvidia-modelopt[torch]>=0.37.0" transformers==4.57.1
151
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use Megatron-Bridge LoRA e2e to pre-load and save (Deepseek)
152
+ run: |
153
+ ray stop --force
154
+ ALL_OFFLOAD=True SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct COMMON_PP=4 LORA_RANK=8 COMMON_VPP=null COMMON_CP=1 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False USE_DIST_CKPT=False \
155
+ bash tests/special_e2e/run_ppo_trainer_megatron.sh
156
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use Megatron-Bridge LoRA e2e to pre-load and save (Deepseek)
157
+ run: |
158
+ ray stop --force
159
+ RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct TOTAL_TRAIN_STEPS=2 SAVE_FREQ=1 COMMON_PP=4 LORA_RANK=8 COMMON_VPP=null COMMON_CP=1 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False USE_DIST_CKPT=False \
160
+ bash tests/special_e2e/run_ppo_trainer_megatron.sh
161
+ - name: clean up
162
+ run: |
163
+ rm -rf checkpoints
164
+
165
+ # Qwen3-0.6B: dense, tie_word_embeddings=True
166
+ e2e_ppo_trainer_megatron-qwen3:
167
+ needs: setup
168
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
169
+ timeout-minutes: 60 # Increase this timeout value as needed
170
+ env:
171
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
172
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
173
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
174
+ HF_ENDPOINT: "https://hf-mirror.com"
175
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
176
+ steps:
177
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
178
+ with:
179
+ fetch-depth: 0
180
+ - name: Install the current repository
181
+ run: |
182
+ pip3 install -r requirements-test.txt
183
+ pip3 install --no-deps -e .
184
+ pip3 install math-verify
185
+ - name: Prepare GSM8K dataset
186
+ run: |
187
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
188
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) testing learning rate scheduler
189
+ run: |
190
+ ray stop --force
191
+ ALL_OFFLOAD=True VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 LR_WARMUP_STEPS=1 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
192
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with FP8 rollout
193
+ run: |
194
+ ray stop --force
195
+ export VLLM_USE_V1=1
196
+ ROLLOUT_QUANTIZATION=fp8 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
197
+ - name: clean up
198
+ run: |
199
+ rm -rf checkpoints
200
+
201
+ cleanup:
202
+ runs-on: ubuntu-latest
203
+ needs:
204
+ [setup, e2e_ppo_trainer_megatron-deepseek, e2e_ppo_trainer_megatron-qwen3]
205
+ if: always()
206
+ steps:
207
+ - id: destroy-runner
208
+ uses: volcengine/vemlp-github-runner@v1
209
+ with:
210
+ mode: "destroy"
211
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
212
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_ppo_trainer_megatron_vllm_2.yml ADDED
@@ -0,0 +1,318 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_ppo_trainer_megatron_vllm_2
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch.
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ # Other entrypoints
46
+ - "!verl/trainer/fsdp_sft_trainer.py"
47
+ # FSDP
48
+ - "!verl/workers/**/*dp_*.py"
49
+ - "!verl/utils/fsdp_utils.py"
50
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
51
+ - "!verl/model_merger/fsdp_model_merger.py"
52
+ pull_request:
53
+ branches:
54
+ - main
55
+ - v0.*
56
+ paths:
57
+ - "**/*.py"
58
+ # Other entrypoints
59
+ - "!docker/**"
60
+ # Docs
61
+ - "!**/*.md"
62
+ - "!docs/**"
63
+ - "!examples/**"
64
+ - "!tests/**"
65
+ - "!verl/trainer/main_*.py"
66
+ - "!verl/trainer/fsdp_sft_trainer.py"
67
+ # FSDP
68
+ - "!verl/workers/**/*dp_*.py"
69
+ - "!verl/utils/fsdp_utils.py"
70
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
71
+ - "!verl/model_merger/fsdp_model_merger.py"
72
+ # Entrypoints
73
+ - ".github/workflows/e2e_ppo_trainer_megatron_vllm_2.yml"
74
+ - "examples/data_preprocess/gsm8k.py"
75
+ - "examples/data_preprocess/geo3k.py"
76
+ - "tests/special_e2e/run_ppo_trainer_megatron.sh"
77
+ - "verl/trainer/main_ppo.py"
78
+ - "verl/trainer/config/ppo_megatron_trainer.yaml"
79
+
80
+ # Cancel jobs on the same ref if a new one is triggered
81
+ concurrency:
82
+ group: ${{ github.workflow }}-${{ github.ref }}
83
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
84
+
85
+ # Declare permissions just read content.
86
+ permissions:
87
+ contents: read
88
+
89
+ env:
90
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
91
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
92
+
93
+ jobs:
94
+ setup:
95
+ if: github.repository_owner == 'verl-project'
96
+ runs-on: ubuntu-latest
97
+ outputs:
98
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
99
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
100
+ steps:
101
+ - uses: actions/checkout@v4
102
+ - id: create-runner
103
+ uses: volcengine/vemlp-github-runner@v1
104
+ with:
105
+ mode: "create"
106
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
107
+ mlp-image: "${{ env.IMAGE }}"
108
+
109
+ e2e_ppo_trainer_megatron-moe-expert-parallel:
110
+ needs: setup
111
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
112
+ timeout-minutes: 60 # Increase this timeout value as needed
113
+ env:
114
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
115
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
116
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
117
+ HF_ENDPOINT: "https://hf-mirror.com"
118
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
119
+ steps:
120
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
121
+ with:
122
+ fetch-depth: 0
123
+ - name: Install the current repository
124
+ run: |
125
+ pip3 install -r requirements-test.txt
126
+ pip3 install --no-deps --force-reinstall .
127
+ pip3 install git+https://github.com/NVIDIA-NeMo/Megatron-Bridge.git@83a7c11 --no-deps --no-build-isolation
128
+ pip3 install git+https://github.com/NVIDIA/Megatron-LM.git@5455f0a --no-deps --no-build-isolation
129
+ pip3 install "nvidia-modelopt[torch]>=0.37.0" transformers==4.57.1
130
+ - name: Prepare GSM8K dataset
131
+ run: |
132
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
133
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron-Bridge (Qwen3-30B-A3B-Instruct-2507)
134
+ run: |
135
+ ray stop --force
136
+ ADV_ESTIMATOR=grpo USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen2moe_minimal.json \
137
+ PPO_MAX_TOKEN_LEN=1024 FWD_MAX_TOKEN_LEN=1024 \
138
+ MAX_PROMPT_LENGTH=512 MAX_RESPONSE_LENGTH=512 \
139
+ MODEL_ID=Qwen/Qwen3-30B-A3B-Instruct-2507 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False \
140
+ COMMON_PP=2 COMMON_VPP=null COMMON_CP=1 COMMON_TP=4 COMMON_EP=4 COMMON_ETP=1 INFER_TP=8 \
141
+ USE_DIST_CKPT=True ALL_OFFLOAD=True SKIP_SAVE_HF_MODEL=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh
142
+ - name: Running GSM8K E2E training tests with 3D parallelism with FP8 rollout on 8 L20 GPUs with Megatron-Bridge (Qwen3-30B-A3B-Instruct-2507)
143
+ run: |
144
+ ray stop --force
145
+ ADV_ESTIMATOR=grpo USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen2moe_minimal.json \
146
+ PPO_MAX_TOKEN_LEN=1024 FWD_MAX_TOKEN_LEN=1024 \
147
+ MAX_PROMPT_LENGTH=512 MAX_RESPONSE_LENGTH=512 \
148
+ MODEL_ID=Qwen/Qwen3-30B-A3B-Instruct-2507 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False \
149
+ COMMON_PP=2 COMMON_VPP=null COMMON_CP=1 COMMON_TP=4 COMMON_EP=4 COMMON_ETP=1 INFER_TP=2 \
150
+ USE_DIST_CKPT=True ALL_OFFLOAD=True SKIP_SAVE_HF_MODEL=1 ROLLOUT_QUANTIZATION=fp8 bash tests/special_e2e/run_ppo_trainer_megatron.sh
151
+ - name: clean up
152
+ run: |
153
+ rm -rf checkpoints
154
+ - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron-Bridge LoRA (Qwen3-30B-A3B-Instruct-2507)
155
+ run: |
156
+ ray stop --force
157
+ ADV_ESTIMATOR=grpo USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen2moe_minimal.json \
158
+ PPO_MAX_TOKEN_LEN=1024 FWD_MAX_TOKEN_LEN=1024 \
159
+ MAX_PROMPT_LENGTH=512 MAX_RESPONSE_LENGTH=512 LORA_RANK=8 CRITIC_LORA_RANK=8 \
160
+ MODEL_ID=Qwen/Qwen3-30B-A3B-Instruct-2507 USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False \
161
+ COMMON_PP=2 COMMON_VPP=null COMMON_CP=1 COMMON_TP=4 COMMON_EP=2 COMMON_ETP=1 INFER_TP=8 \
162
+ USE_DIST_CKPT=False LORA_MERGE=True ALL_OFFLOAD=True SKIP_SAVE_HF_MODEL=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh
163
+ - name: clean up
164
+ run: |
165
+ rm -rf checkpoints
166
+
167
+ e2e_ppo_trainer_fsdp_vllm:
168
+ needs: setup
169
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
170
+ timeout-minutes: 60 # Increase this timeout value as needed
171
+ env:
172
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
173
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
174
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
175
+ HF_ENDPOINT: "https://hf-mirror.com"
176
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
177
+ steps:
178
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
179
+ with:
180
+ fetch-depth: 0
181
+ - name: Install the current repository
182
+ run: |
183
+ pip3 install -r requirements-test.txt
184
+ pip3 install --no-deps -e .
185
+ - name: Prepare GSM8K dataset
186
+ run: |
187
+ ray stop --force
188
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
189
+ # Function RM
190
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (FSDP_SIZE=8)
191
+ run: |
192
+ ray stop --force
193
+ VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-fsdp-size8" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
194
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm after resuming
195
+ run: |
196
+ ray stop --force
197
+ RESUME_MODE=auto VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-fsdp-size8" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
198
+ - name: Test merging FSDP checkpoints (Qwen Actor)
199
+ run: |
200
+ exp_name="qwen2.5-0.5b-function-reward-minimal-fsdp-size8"
201
+ python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
202
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (DDP_SIZE=2, FSDP_SIZE=4)
203
+ run: |
204
+ ray stop --force
205
+ VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True FSDP_SIZE=4 USE_KL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-ddp-size2-fsdp-size4" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
206
+ - name: Test merging DDP+FSDP checkpoints (Qwen Actor)
207
+ run: |
208
+ exp_name="qwen2.5-0.5b-function-reward-minimal-ddp-size2-fsdp-size4"
209
+ python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
210
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (FSDP2)
211
+ run: |
212
+ ray stop --force
213
+ VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-fsdp2-size8" STRATEGY=fsdp2 bash tests/special_e2e/ppo_trainer/run_function_reward.sh
214
+ - name: Test merging FSDP2 checkpoints (Qwen Actor)
215
+ run: |
216
+ exp_name="qwen2.5-0.5b-function-reward-minimal-fsdp2-size8"
217
+ python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
218
+ - name: Running GSM8K E2E without rmpad using function rm
219
+ run: |
220
+ ray stop --force
221
+ RM_PAD=False bash tests/special_e2e/ppo_trainer/run_function_reward.sh
222
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm (GRPO)
223
+ run: |
224
+ ray stop --force
225
+ CUSTOM_REWARD_FN=True ADV_ESTIMATOR=grpo USE_KL=True bash tests/special_e2e/ppo_trainer/run_function_reward.sh
226
+ # - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm (ReMax)
227
+ # run: |
228
+ # ray stop --force
229
+ # ADV_ESTIMATOR=remax USE_KL=True bash tests/special_e2e/ppo_trainer/run_function_reward.sh
230
+ # LoRA tests
231
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm
232
+ run: |
233
+ ray stop --force
234
+ ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors bash tests/special_e2e/ppo_trainer/run_function_reward.sh
235
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm and layered_summon
236
+ run: |
237
+ ray stop --force
238
+ ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors LAYERED_SUMMON=True TOTAL_TRAIN_STEPS=1 SAVE_FREQ=1 FSDP_SIZE=4 VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
239
+ - name: Test GRPO LoRA checkpoints merging function
240
+ run: |
241
+ export EXP_NAME="qwen2.5-0.5b-function-reward-minimal"
242
+ ls checkpoints/verl-test/${EXP_NAME}/global_step_1/actor
243
+ cat checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/huggingface/config.json
244
+ python3 -m verl.model_merger merge --backend fsdp --local_dir checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/ --target_dir checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/huggingface
245
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm and layered_summon with fsdp2
246
+ run: |
247
+ ray stop --force
248
+ ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors LAYERED_SUMMON=True STRATEGY=fsdp2 bash tests/special_e2e/ppo_trainer/run_function_reward.sh
249
+
250
+ e2e_ppo_trainer_fsdp-qwen2_5vl-3b:
251
+ needs: setup
252
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
253
+ timeout-minutes: 40 # Increase this timeout value as needed
254
+ env:
255
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
256
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
257
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
258
+ HF_ENDPOINT: "https://hf-mirror.com"
259
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
260
+ steps:
261
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
262
+ with:
263
+ fetch-depth: 0
264
+ - name: Install the current repository
265
+ run: |
266
+ pip3 install -r requirements-test.txt
267
+ pip3 install --no-deps -e .
268
+ # Geo3k
269
+ - name: Prepare GEO3K dataset
270
+ run: |
271
+ python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/hiyouga/geometry3k/
272
+ - name: Running GEO3K VLM GRPO E2E training tests on 8 L20 GPUs with rmpad using function rm
273
+ run: |
274
+ ray stop --force
275
+ TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
276
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
277
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
278
+ ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
279
+ SP_SIZE=2 \
280
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
281
+
282
+ - name: Running GEO3K VLM PPO E2E training tests on 8 L20 GPUs with rmpad using function rm
283
+ run: |
284
+ ray stop --force
285
+ TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
286
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
287
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
288
+ ADV_ESTIMATOR=gae RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
289
+ SP_SIZE=2 \
290
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
291
+ - name: Running GEO3K VLM GRPO E2E lora training tests on 8 L20 GPUs with rmpad using function rm
292
+ run: |
293
+ ray stop --force
294
+ TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
295
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
296
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
297
+ ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
298
+ SP_SIZE=2 \
299
+ LORA_RANK=32 LORA_EXCLUDE=".*visual.*" \
300
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
301
+
302
+ cleanup:
303
+ runs-on: ubuntu-latest
304
+ needs:
305
+ [
306
+ setup,
307
+ e2e_ppo_trainer_megatron-moe-expert-parallel,
308
+ e2e_ppo_trainer_fsdp-qwen2_5vl-3b,
309
+ e2e_ppo_trainer_fsdp_vllm,
310
+ ]
311
+ if: always()
312
+ steps:
313
+ - id: destroy-runner
314
+ uses: volcengine/vemlp-github-runner@v1
315
+ with:
316
+ mode: "destroy"
317
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
318
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_ppo_trainer_megatron_vllm_2_ascend.yml ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_ppo_trainer_megatron_vllm_2_ascend
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch.
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ # Other entrypoints
46
+ - "!verl/trainer/fsdp_sft_trainer.py"
47
+ # FSDP
48
+ - "!verl/workers/**/*dp_*.py"
49
+ - "!verl/utils/fsdp_utils.py"
50
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
51
+ - "!verl/model_merger/fsdp_model_merger.py"
52
+ pull_request:
53
+ branches:
54
+ - main
55
+ - v0.*
56
+ paths:
57
+ - "**/*.py"
58
+ # Other entrypoints
59
+ - "!docker/**"
60
+ # Docs
61
+ - "!**/*.md"
62
+ - "!docs/**"
63
+ - "!examples/**"
64
+ - "!tests/**"
65
+ - "!verl/trainer/main_*.py"
66
+ - "!verl/trainer/fsdp_sft_trainer.py"
67
+ # FSDP
68
+ - "!verl/workers/**/*dp_*.py"
69
+ - "!verl/utils/fsdp_utils.py"
70
+ - "!verl/utils/checkpoint/fsdp_checkpoint_manager.py"
71
+ - "!verl/model_merger/fsdp_model_merger.py"
72
+ # Entrypoints
73
+ - ".github/workflows/e2e_ppo_trainer_megatron_vllm_2_ascend.yml"
74
+ - "examples/data_preprocess/gsm8k.py"
75
+ - "examples/data_preprocess/geo3k.py"
76
+ - "tests/special_e2e/run_ppo_trainer_megatron.sh"
77
+ - "verl/trainer/main_ppo.py"
78
+ - "verl/trainer/config/ppo_megatron_trainer.yaml"
79
+
80
+ # Cancel jobs on the same ref if a new one is triggered
81
+ concurrency:
82
+ group: ${{ github.workflow }}-${{ github.ref }}
83
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
84
+
85
+ # Declare permissions just read content.
86
+ permissions:
87
+ contents: read
88
+
89
+ jobs:
90
+ e2e_ppo_trainer_fsdp_vllm_ascend:
91
+ if: github.repository_owner == 'verl-project'
92
+ runs-on: linux-aarch64-a2b3-8
93
+ timeout-minutes: 90 # Increase this timeout value as needed
94
+ container:
95
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
96
+ options: >-
97
+ --shm-size 16g
98
+ env:
99
+ HF_ENDPOINT: "https://hf-mirror.com"
100
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
101
+ steps:
102
+ - name: Check npu and CANN info
103
+ run: |
104
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
105
+ npu-smi info
106
+ - name: Check initial pip list from image
107
+ run: |
108
+ pip list
109
+ - name: Checkout verl-project/verl repo
110
+ uses: actions/checkout@v4
111
+ with:
112
+ fetch-depth: 0
113
+ clean: true
114
+ - name: Install the current repository
115
+ run: |
116
+ pip install -r requirements-npu.txt
117
+ pip install --no-deps -e .
118
+ - name: Check final pip list
119
+ run: |
120
+ pip list
121
+ - name: Prepare weights
122
+ run: |
123
+ ln -s /root/.cache/models ~/models
124
+ - name: Prepare GSM8K dataset
125
+ run: |
126
+ python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
127
+ # Function RM
128
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (DDP_SIZE=2, FSDP_SIZE=4)
129
+ run: |
130
+ ray stop --force
131
+ VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True FSDP_SIZE=4 USE_KL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-ddp-size2-fsdp-size4" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
132
+ - name: Test merging DDP+FSDP checkpoints (Qwen Actor)
133
+ run: |
134
+ exp_name="qwen2.5-0.5b-function-reward-minimal-ddp-size2-fsdp-size4"
135
+ python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
136
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm with validation and saving (FSDP2)
137
+ run: |
138
+ ray stop --force
139
+ VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 SAVE_HF_MODEL=True VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal-fsdp2-size8" STRATEGY=fsdp2 bash tests/special_e2e/ppo_trainer/run_function_reward.sh
140
+ - name: Test merging FSDP2 checkpoints (Qwen Actor)
141
+ run: |
142
+ exp_name="qwen2.5-0.5b-function-reward-minimal-fsdp2-size8"
143
+ python -m verl.model_merger test --backend fsdp --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
144
+ - name: Running GSM8K E2E without rmpad using function rm
145
+ run: |
146
+ ray stop --force
147
+ RM_PAD=False bash tests/special_e2e/ppo_trainer/run_function_reward.sh
148
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm (GRPO)
149
+ run: |
150
+ ray stop --force
151
+ CUSTOM_REWARD_FN=True ADV_ESTIMATOR=grpo USE_KL=True bash tests/special_e2e/ppo_trainer/run_function_reward.sh
152
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm and layered_summon
153
+ run: |
154
+ ray stop --force
155
+ ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors LAYERED_SUMMON=True TOTAL_TRAIN_STEPS=1 SAVE_FREQ=1 FSDP_SIZE=4 VERL_EXP_NAME="qwen2.5-0.5b-function-reward-minimal" bash tests/special_e2e/ppo_trainer/run_function_reward.sh
156
+ - name: Test GRPO LoRA checkpoints merging function
157
+ run: |
158
+ export EXP_NAME="qwen2.5-0.5b-function-reward-minimal"
159
+ ls checkpoints/verl-test/${EXP_NAME}/global_step_1/actor
160
+ cat checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/huggingface/config.json
161
+ python3 -m verl.model_merger merge --backend fsdp --local_dir checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/ --target_dir checkpoints/verl-test/${EXP_NAME}/global_step_1/actor/huggingface
162
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with grpo lora using function rm with use_shm and layered_summon with fsdp2
163
+ run: |
164
+ ray stop --force
165
+ ADV_ESTIMATOR=grpo USE_SHM=True LORA_RANK=32 LOAD_FORMAT=safetensors LAYERED_SUMMON=True STRATEGY=fsdp2 bash tests/special_e2e/ppo_trainer/run_function_reward.sh
166
+
167
+ e2e_ppo_trainer_fsdp-qwen2_5vl-3b_ascend:
168
+ if: github.repository_owner == 'verl-project'
169
+ runs-on: linux-aarch64-a2b3-8
170
+ timeout-minutes: 60 # Increase this timeout value as needed
171
+ container:
172
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
173
+ options: >-
174
+ --shm-size 16g
175
+ env:
176
+ HF_ENDPOINT: "https://hf-mirror.com"
177
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
178
+ steps:
179
+ - name: Check npu and CANN info
180
+ run: |
181
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
182
+ npu-smi info
183
+ - name: Check initial pip list from image
184
+ run: |
185
+ pip list
186
+ - name: Checkout verl-project/verl repo
187
+ uses: actions/checkout@v4
188
+ with:
189
+ fetch-depth: 0
190
+ clean: true
191
+ - name: Install the current repository
192
+ run: |
193
+ pip install -r requirements-npu.txt
194
+ pip install --no-deps -e .
195
+ pip install trl==0.26.0
196
+ - name: Check final pip list
197
+ run: |
198
+ pip list
199
+ - name: Prepare weights
200
+ run: |
201
+ ln -s /root/.cache/models ~/models
202
+ # Geo3k
203
+ - name: Prepare GEO3K dataset
204
+ run: |
205
+ python examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/.cache/datasets/hiyouga/geometry3k
206
+ - name: Running GEO3K VLM GRPO E2E training tests on 8 L20 GPUs with rmpad using function rm
207
+ run: |
208
+ ray stop --force
209
+ TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
210
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
211
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
212
+ ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
213
+ SP_SIZE=2 \
214
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
215
+ - name: Running GEO3K VLM PPO E2E training tests on 8 L20 GPUs with rmpad using function rm
216
+ run: |
217
+ ray stop --force
218
+ TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
219
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
220
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
221
+ ADV_ESTIMATOR=gae RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
222
+ SP_SIZE=2 \
223
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
224
+ - name: Running GEO3K VLM GRPO E2E lora training tests on 8 L20 GPUs with rmpad using function rm
225
+ run: |
226
+ ray stop --force
227
+ TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \
228
+ MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \
229
+ MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct \
230
+ ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \
231
+ SP_SIZE=2 \
232
+ LORA_RANK=32 LORA_EXCLUDE=".*visual.*" \
233
+ bash tests/special_e2e/ppo_trainer/run_function_reward.sh
.github/workflows/e2e_ppo_trainer_veomni_vllm.yml ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_ppo_trainer_veomni_vllm
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch.
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ push:
40
+ branches:
41
+ - main
42
+ - v0.*
43
+ paths:
44
+ - "**/*.py"
45
+ # Other entrypoints
46
+ - "!verl/trainer/fsdp_sft_trainer.py"
47
+ # Megatron
48
+ - "!verl/workers/**/megatron_*.py"
49
+ pull_request:
50
+ branches:
51
+ - main
52
+ - v0.*
53
+ paths:
54
+ - "**/*.py"
55
+ # Other entrypoints
56
+ - "!docker/**"
57
+ # Docs
58
+ - "!**/*.md"
59
+ - "!docs/**"
60
+ - "!examples/**"
61
+ - "!tests/**"
62
+ - "!verl/trainer/main_*.py"
63
+ - "!verl/trainer/fsdp_sft_trainer.py"
64
+ # Megatron
65
+ - "!verl/workers/**/megatron_*.py"
66
+ # Entrypoints
67
+ - ".github/workflows/e2e_ppo_trainer_veomni_vllm.yml"
68
+ - "examples/data_preprocess/gsm8k.py"
69
+ - "examples/data_preprocess/geo3k.py"
70
+ - "tests/special_e2e/run_ppo_trainer_veomni.sh"
71
+ - "verl/trainer/main_ppo.py"
72
+ - "verl/trainer/config/ppo_trainer.yaml"
73
+
74
+ # Cancel jobs on the same ref if a new one is triggered
75
+ concurrency:
76
+ group: ${{ github.workflow }}-${{ github.ref }}
77
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
78
+
79
+ # Declare permissions just read content.
80
+ permissions:
81
+ contents: read
82
+
83
+ env:
84
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
85
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
86
+
87
+ jobs:
88
+ setup:
89
+ if: github.repository_owner == 'verl-project'
90
+ runs-on: ubuntu-latest
91
+ outputs:
92
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
93
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
94
+ steps:
95
+ - uses: actions/checkout@v4
96
+ - id: create-runner
97
+ uses: volcengine/vemlp-github-runner@v1
98
+ with:
99
+ mode: "create"
100
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
101
+ mlp-image: "${{ env.IMAGE }}"
102
+
103
+ e2e_ppo_trainer_veomni_vllm:
104
+ needs: setup
105
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
106
+ timeout-minutes: 60 # Increase this timeout value as needed
107
+ env:
108
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
109
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
110
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
111
+ HF_ENDPOINT: "https://hf-mirror.com"
112
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
113
+ steps:
114
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
115
+ with:
116
+ fetch-depth: 0
117
+ - name: Install the current repository
118
+ run: |
119
+ pip3 install -r requirements-test.txt
120
+ pip3 install --no-deps -e .
121
+ pip3 install git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.4
122
+ - name: Prepare GSM8K dataset
123
+ run: |
124
+ ray stop --force
125
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
126
+ - name: Prepare GEO3K dataset
127
+ run: |
128
+ ray stop --force
129
+ python3 examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/models/hf_data/hiyouga/geometry3k/
130
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with veomni engine (FSDP_SIZE=4, USP=2)
131
+ run: |
132
+ ray stop --force
133
+ FSDP_SIZE=4 SP_SIZE=2 bash tests/special_e2e/run_ppo_trainer_veomni.sh
134
+ - name: Running GEO3K E2E training tests on 8 L20 GPUs with veomni engine (FSDP_SIZE=8, USP=1)
135
+ run: |
136
+ ray stop --force
137
+ MODEL_ID=Qwen/Qwen3-VL-2B-Instruct TRAIN_FILES=${HOME}/data/geo3k/train.parquet VAL_FILES=${HOME}/data/gsm8k/test.parquet FSDP_SIZE=8 SP_SIZE=1 bash tests/special_e2e/run_ppo_trainer_veomni.sh
138
+
139
+ cleanup:
140
+ runs-on: ubuntu-latest
141
+ needs:
142
+ [
143
+ setup,
144
+ e2e_ppo_trainer_veomni_vllm,
145
+ ]
146
+ if: always()
147
+ steps:
148
+ - id: destroy-runner
149
+ uses: volcengine/vemlp-github-runner@v1
150
+ with:
151
+ mode: "destroy"
152
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
153
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_sft_llm.yml ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_sft_llm
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ push:
38
+ branches:
39
+ - main
40
+ - v0.*
41
+ pull_request:
42
+ branches:
43
+ - main
44
+ - v0.*
45
+ paths:
46
+ - "**/*.py"
47
+ # Other entrypoints
48
+ - "!examples/**"
49
+ - "!tests/**"
50
+ - "!verl/trainer/main_*.py"
51
+ - "!verl/trainer/fsdp_sft_trainer.py"
52
+
53
+ # Megatron
54
+ - "!verl/workers/**/megatron_*.py"
55
+ # Entrypoints
56
+ - ".github/workflows/e2e_sft_llm.yml"
57
+ - "examples/data_preprocess/gsm8k.py"
58
+ - "tests/special_e2e/sft"
59
+ - "verl/trainer/fsdp_sft_trainer.py"
60
+ - "verl/trainer/config/sft_trainer.yaml"
61
+
62
+ # Cancel jobs on the same ref if a new one is triggered
63
+ concurrency:
64
+ group: ${{ github.workflow }}-${{ github.ref }}
65
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
66
+
67
+ # Declare permissions just read content.
68
+ permissions:
69
+ contents: read
70
+
71
+ env:
72
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
73
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
74
+
75
+ jobs:
76
+ setup:
77
+ if: github.repository_owner == 'verl-project'
78
+ runs-on: ubuntu-latest
79
+ outputs:
80
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
81
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
82
+ steps:
83
+ - uses: actions/checkout@v4
84
+ - id: create-runner
85
+ uses: volcengine/vemlp-github-runner@v1
86
+ with:
87
+ mode: "create"
88
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
89
+ mlp-image: "${{ env.IMAGE }}"
90
+ e2e_sft_llm:
91
+ needs: setup
92
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
93
+ timeout-minutes: 30 # Increase this timeout value as needed
94
+ env:
95
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
96
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
97
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
98
+ HF_ENDPOINT: "https://hf-mirror.com"
99
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
100
+ steps:
101
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
102
+ with:
103
+ fetch-depth: 0
104
+ - name: Install the current repository
105
+ run: |
106
+ pip3 install peft
107
+ pip3 install -r requirements-test.txt
108
+ pip3 install --no-deps -e .
109
+ pip3 install git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.4
110
+ - name: Prepare gsm8k dataset
111
+ run: |
112
+ ray stop --force
113
+ python3 examples/data_preprocess/gsm8k_multiturn_sft.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
114
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with rmpad using function rm
115
+ run: |
116
+ ray stop --force
117
+ bash tests/special_e2e/sft/run_sft.sh
118
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs w/o rmpad using function rm
119
+ run: |
120
+ ray stop --force
121
+ RM_PAD=False bash tests/special_e2e/sft/run_sft.sh
122
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with sequence parallism
123
+ run: |
124
+ ray stop --force
125
+ SP_SIZE=2 bash tests/special_e2e/sft/run_sft.sh
126
+ - name: Running GSM8K E2E training tests on 8 L20 GPUs with sequence parallism and liger
127
+ run: |
128
+ ray stop --force
129
+ SP_SIZE=2 LIGER=True bash tests/special_e2e/sft/run_sft.sh
130
+ - name: Running GSM8K E2E training tests with LoRA
131
+ run: |
132
+ ray stop --force
133
+ LORA_RANK=32 bash tests/special_e2e/sft/run_sft.sh
134
+ - name: Run GSM8K E2E training and resume tests resuming from the checkpoint manager
135
+ run: |
136
+ ray stop --force
137
+ LORA_RANK=32 RESUME_MODE=auto TOTAL_TRAIN_STEP=2 bash tests/special_e2e/sft/run_sft.sh
138
+ # TODO: multiturn
139
+ - name: Running GSM8K E2E training tests with multiturn and various configs and compare results
140
+ run: |
141
+ bash tests/special_e2e/sft/test_sft_engine_all.sh
142
+
143
+ cleanup:
144
+ runs-on: ubuntu-latest
145
+ needs: [setup, e2e_sft_llm]
146
+ if: always()
147
+ steps:
148
+ - id: destroy-runner
149
+ uses: volcengine/vemlp-github-runner@v1
150
+ with:
151
+ mode: "destroy"
152
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
153
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/e2e_sft_llm_ascend.yml ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_sft_llm_ascend
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ push:
38
+ branches:
39
+ - main
40
+ - v0.*
41
+ pull_request:
42
+ branches:
43
+ - main
44
+ - v0.*
45
+ paths:
46
+ - "**/*.py"
47
+ # Other entrypoints
48
+ - "!examples/**"
49
+ - "!tests/**"
50
+ - "!verl/trainer/main_*.py"
51
+ - "!verl/trainer/fsdp_sft_trainer.py"
52
+
53
+ # Megatron
54
+ - "!verl/workers/**/megatron_*.py"
55
+ # Entrypoints
56
+ - ".github/workflows/e2e_sft_llm_ascend.yml"
57
+ - "examples/data_preprocess/gsm8k.py"
58
+ - "tests/special_e2e/sft"
59
+ - "verl/trainer/fsdp_sft_trainer.py"
60
+ - "verl/trainer/config/sft_trainer.yaml"
61
+
62
+ # Cancel jobs on the same ref if a new one is triggered
63
+ concurrency:
64
+ group: ${{ github.workflow }}-${{ github.ref }}
65
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
66
+
67
+ # Declare permissions just read content.
68
+ permissions:
69
+ contents: read
70
+
71
+ jobs:
72
+ e2e_sft_llm_ascend:
73
+ if: github.repository_owner == 'verl-project'
74
+ runs-on: linux-aarch64-a2b3-8
75
+ timeout-minutes: 90 # Increase this timeout value as needed
76
+ container:
77
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
78
+ options: >-
79
+ --shm-size 16g
80
+ env:
81
+ HF_ENDPOINT: "https://hf-mirror.com"
82
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
83
+ steps:
84
+ - name: Check npu and CANN info
85
+ run: |
86
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
87
+ npu-smi info
88
+ - name: Check initial pip list from image
89
+ run: |
90
+ pip list
91
+ - name: Checkout verl-project/verl repo
92
+ uses: actions/checkout@v4
93
+ with:
94
+ fetch-depth: 0
95
+ clean: true
96
+ - name: Install the current repository
97
+ run: |
98
+ pip install -r requirements-npu.txt
99
+ pip install -e .
100
+ pip install git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.4
101
+ pip install pandas==2.3.3
102
+ pip uninstall -y mbridge
103
+ pip install git+https://github.com/ISEEKYAN/mbridge.git@89eb10
104
+ - name: Check final pip list
105
+ run: |
106
+ pip list
107
+ - name: Prepare weights
108
+ run: |
109
+ ln -s /root/.cache/models ~/models
110
+ - name: Prepare gsm8k dataset
111
+ run: |
112
+ python3 examples/data_preprocess/gsm8k_multiturn_sft.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
113
+ - name: Running GSM8K E2E training tests on 8 NPUs with rmpad using function rm
114
+ run: |
115
+ ray stop --force
116
+ bash tests/special_e2e/sft/run_sft.sh
117
+ - name: Running GSM8K E2E training tests on 8 NPUs w/o rmpad using function rm
118
+ run: |
119
+ ray stop --force
120
+ RM_PAD=False bash tests/special_e2e/sft/run_sft.sh
121
+ - name: Running GSM8K E2E training tests on 8 NPUs with sequence parallism
122
+ run: |
123
+ ray stop --force
124
+ SP_SIZE=2 bash tests/special_e2e/sft/run_sft.sh
125
+ - name: Running GSM8K E2E training tests with LoRA
126
+ run: |
127
+ ray stop --force
128
+ LORA_RANK=32 bash tests/special_e2e/sft/run_sft.sh
129
+ - name: Run GSM8K E2E training and resume tests resuming from the checkpoint manager
130
+ run: |
131
+ ray stop --force
132
+ LORA_RANK=32 RESUME_MODE=auto TOTAL_TRAIN_STEP=2 bash tests/special_e2e/sft/run_sft.sh
133
+ - name: Running GSM8K E2E training tests with multiturn and various configs and compare results
134
+ run: |
135
+ ray stop --force
136
+ rm -rf ~/verl/test/log
137
+ mkdir -p ~/verl/test/log
138
+ export VERL_FILE_LOGGER_ROOT=~/verl/test/log
139
+ # test with single gpu as golden
140
+ echo "run with single gpu as golden"
141
+ BACKEND=fsdp SP_SIZE=1 FSDP_SIZE=1 NUM_GPUS=1 FSDP_STRATEGY=fsdp VERL_FILE_LOGGER_PATH=~/verl/test/log/golden.jsonl bash tests/special_e2e/sft/run_sft_engine.sh
142
+ # test with fsdp 1
143
+ echo "run with sp2 fsdp_size2 num_gpus8 fsdp_strategy fsdp pad_mode no_padding"
144
+ BACKEND=fsdp SP_SIZE=2 FSDP_SIZE=2 NUM_GPUS=8 FSDP_STRATEGY=fsdp PAD_MODE=no_padding bash tests/special_e2e/sft/run_sft_engine.sh
145
+ # test with fsdp 1 use_remove_padding and pad_mode no_padding
146
+ echo "run with sp4 fsdp_size4 num_gpus8 fsdp_strategy fsdp pad_mode no_padding use_remove_padding False"
147
+ BACKEND=fsdp SP_SIZE=1 FSDP_SIZE=-1 NUM_GPUS=8 FSDP_STRATEGY=fsdp PAD_MODE=no_padding USE_REMOVE_PADDING=False bash tests/special_e2e/sft/run_sft_engine.sh
148
+ # test with fsdp 2
149
+ echo "run with sp2 fsdp_size2 num_gpus8 fsdp_strategy fsdp2"
150
+ BACKEND=fsdp SP_SIZE=2 FSDP_SIZE=2 NUM_GPUS=8 FSDP_STRATEGY=fsdp2 bash tests/special_e2e/sft/run_sft_engine.sh
151
+ # test with veomni
152
+ echo "run with sp2 fsdp_size4 num_gpus8 fsdp_strategy fsdp2"
153
+ BACKEND=veomni SP_SIZE=2 FSDP_SIZE=4 NUM_GPUS=8 FSDP_STRATEGY=fsdp2 bash tests/special_e2e/sft/run_sft_engine.sh
154
+ # test with megatron
155
+ echo "run with tp2 pp2 vpp2 cp2 num_gpus8"
156
+ BACKEND=megatron TP_SIZE=2 PP_SIZE=2 VPP_SIZE=NULL CP_SIZE=2 NUM_GPUS=8 bash tests/special_e2e/sft/run_sft_engine.sh
157
+ # test with cp in ray
158
+ echo "run with tp2 pp2 vpp2 cp2 num_gpus8 mode=ray"
159
+ BACKEND=megatron TP_SIZE=2 PP_SIZE=2 VPP_SIZE=NULL CP_SIZE=2 NUM_GPUS=8 mode=ray bash tests/special_e2e/sft/run_sft_engine.sh
160
+ rm -rf ~/verl/test/log
.github/workflows/e2e_sft_vlm.yml ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: e2e_sft_vlm
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ push:
38
+ branches:
39
+ - main
40
+ - v0.*
41
+ pull_request:
42
+ branches:
43
+ - main
44
+ - v0.*
45
+ paths:
46
+ - "**/*.py"
47
+ # Other entrypoints
48
+ - "!examples/**"
49
+ - "!tests/**"
50
+ - "!verl/trainer/main_*.py"
51
+ - "!verl/trainer/fsdp_sft_trainer.py"
52
+
53
+ # Megatron
54
+ - "!verl/workers/**/megatron_*.py"
55
+ # Entrypoints
56
+ - ".github/workflows/e2e_sft_vlm.yml"
57
+ - "examples/data_preprocess/gsm8k.py"
58
+ - "tests/special_e2e/sft"
59
+ - "verl/trainer/fsdp_sft_trainer.py"
60
+ - "verl/trainer/config/sft_trainer.yaml"
61
+
62
+ # Cancel jobs on the same ref if a new one is triggered
63
+ concurrency:
64
+ group: ${{ github.workflow }}-${{ github.ref }}
65
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
66
+
67
+ # Declare permissions just read content.
68
+ permissions:
69
+ contents: read
70
+
71
+ env:
72
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
73
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
74
+
75
+ jobs:
76
+ setup:
77
+ if: github.repository_owner == 'verl-project'
78
+ runs-on: ubuntu-latest
79
+ outputs:
80
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
81
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
82
+ steps:
83
+ - uses: actions/checkout@v4
84
+ - id: create-runner
85
+ uses: volcengine/vemlp-github-runner@v1
86
+ with:
87
+ mode: "create"
88
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
89
+ mlp-image: "${{ env.IMAGE }}"
90
+ e2e_sft_vlm:
91
+ needs: setup
92
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
93
+ timeout-minutes: 30 # Increase this timeout value as needed
94
+ env:
95
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
96
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
97
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
98
+ HF_ENDPOINT: "https://hf-mirror.com"
99
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
100
+ steps:
101
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
102
+ with:
103
+ fetch-depth: 0
104
+ - name: Install the current repository
105
+ run: |
106
+ pip3 install peft
107
+ pip3 install -r requirements-test.txt
108
+ pip3 install --no-deps -e .
109
+ pip3 install git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.4
110
+ - name: Prepare pokemon-gpt4o-captions dataset
111
+ run: |
112
+ ray stop --force
113
+ python3 examples/data_preprocess/pokemon.py --local_dataset_path ${HOME}/models/hf_data/pokemon-gpt4o-captions
114
+ - name: Running Pokemon E2E training tests with multiturn and various configs and compare results
115
+ run: |
116
+ MODEL_ID=Qwen/Qwen3-VL-2B-Instruct DATASET_DIR=~/data/pokemon-gpt4o-captions VPP_SIZE=null bash tests/special_e2e/sft/test_sft_engine_all.sh
117
+
118
+ cleanup:
119
+ runs-on: ubuntu-latest
120
+ needs: [setup, e2e_sft_vlm]
121
+ if: always()
122
+ steps:
123
+ - id: destroy-runner
124
+ uses: volcengine/vemlp-github-runner@v1
125
+ with:
126
+ mode: "destroy"
127
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
128
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/gpu_unit_tests.yml ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: GPU unit tests
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ push:
38
+ branches:
39
+ - main
40
+ - v0.4.x
41
+ paths:
42
+ - "**/*.py"
43
+ - .github/workflows/gpu_unit_tests.yml
44
+ pull_request:
45
+ branches:
46
+ - main
47
+ - v0.4.x
48
+ paths:
49
+ # The order that you define paths patterns matters:
50
+ # A matching negative pattern (prefixed with !) after a positive match will exclude the path.
51
+ # A matching positive pattern after a negative match will include the path again.
52
+ - "**/*.py"
53
+ # Other entrypoints
54
+ - "!examples/**"
55
+ - "!verl/trainer/main_*.py"
56
+ - "!verl/trainer/fsdp_sft_trainer.py"
57
+ # Entrypoints
58
+ - .github/workflows/gpu_unit_tests.yml
59
+ - "tests/**test_*.py"
60
+ # Ignore CPU tests
61
+ - "!tests/*_on_cpu.py"
62
+
63
+ # Cancel jobs on the same ref if a new one is triggered
64
+ concurrency:
65
+ group: ${{ github.workflow }}-${{ github.ref }}
66
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
67
+
68
+ # Declare permissions just read content.
69
+ permissions:
70
+ contents: read
71
+
72
+ env:
73
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
74
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
75
+
76
+ jobs:
77
+ setup:
78
+ if: github.repository_owner == 'verl-project'
79
+ runs-on: ubuntu-latest
80
+ outputs:
81
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
82
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
83
+ steps:
84
+ - uses: actions/checkout@v4
85
+ - id: create-runner
86
+ uses: volcengine/vemlp-github-runner@v1
87
+ with:
88
+ mode: "create"
89
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
90
+ mlp-image: "${{ env.IMAGE }}"
91
+
92
+ gpu_unit_tests:
93
+ if: github.repository_owner == 'verl-project'
94
+ needs: setup
95
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
96
+ timeout-minutes: 60 # Increase this timeout value as needed
97
+ env:
98
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
99
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
100
+ NO_PROXY: "localhost,127.0.0.1"
101
+ HF_HUB_ENABLE_HF_TRANSFER: 1
102
+ steps:
103
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
104
+ with:
105
+ fetch-depth: 0
106
+ - name: Install the current repository
107
+ run: |
108
+ pip3 install hf_transfer
109
+ pip3 install -r requirements-test.txt
110
+ pip3 install --no-deps -e .
111
+ pip3 install cupy-cuda12x==13.6.0 pytest-asyncio
112
+ pip3 install --ignore-installed blinker
113
+ pip3 install --ignore-installed mlflow "numpy<2.0"
114
+ - name: Run all GPU unit tests
115
+ run: |
116
+ pytest -s -x --ignore-glob="*on_npu.py" --ignore-glob="*test_special_*.py" --ignore-glob='*on_cpu.py' --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob='tests/special*' --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" --ignore-glob="*test_shared_memory*" --ignore-glob="tests/workers/rollout/rollout_trtllm" --ignore-glob="*test_bucketed_weight_transfer*" tests/
117
+ - name: Testing LinearCrossEntropyTP Correctness, Computation Time and Memory Consumption
118
+ run: |
119
+ LOW_MEMORY=True torchrun --standalone --nnodes=1 --nproc-per-node=8 tests/utils/test_special_linear_cross_entropy_tp.py
120
+ - name: Testing FSDP2 actor functionality
121
+ run: |
122
+ torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/actor/test_special_dp_actor.py
123
+ - name: Testing FSDP2 critic functionality
124
+ run: |
125
+ torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/critic/test_special_dp_critic.py
126
+
127
+ cleanup:
128
+ runs-on: ubuntu-latest
129
+ needs: [setup, gpu_unit_tests]
130
+ if: always()
131
+ steps:
132
+ - id: destroy-runner
133
+ uses: volcengine/vemlp-github-runner@v1
134
+ with:
135
+ mode: "destroy"
136
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
137
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/model.yml ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+ # name: Check PR Title
32
+
33
+ name: model
34
+
35
+ on:
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ pull_request:
43
+ branches:
44
+ - main
45
+ - v0.*
46
+ paths:
47
+ - "verl/**/*.py"
48
+ # Entrypoints
49
+ - ".github/workflows/model.yml"
50
+ - "tests/special_distributed/test_fsdp_ckpt.py"
51
+ - "tests/special_distributed/test_tensor_dict.py"
52
+ - "tests/models/**"
53
+ - "tests/special_distributed/run_all.sh"
54
+
55
+ # Declare permissions just read content.
56
+ permissions:
57
+ contents: read
58
+
59
+ # Cancel jobs on the same ref if a new one is triggered
60
+ concurrency:
61
+ group: ${{ github.workflow }}-${{ github.ref }}
62
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
63
+
64
+ env:
65
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
66
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
67
+
68
+ jobs:
69
+ setup:
70
+ if: github.repository_owner == 'verl-project'
71
+ runs-on: ubuntu-latest
72
+ outputs:
73
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
74
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
75
+ steps:
76
+ - uses: actions/checkout@v4
77
+ - id: create-runner
78
+ uses: volcengine/vemlp-github-runner@v1
79
+ with:
80
+ mode: "create"
81
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
82
+ mlp-image: "${{ env.IMAGE }}"
83
+
84
+ model_rmpad:
85
+ needs: setup
86
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
87
+ timeout-minutes: 20 # Increase this timeout value as needed
88
+ env:
89
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
90
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
91
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
92
+ HF_ENDPOINT: "https://hf-mirror.com"
93
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
94
+ steps:
95
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
96
+ with:
97
+ fetch-depth: 0
98
+ - name: Install the current repository and upgrade to latest transformers(4.54.0)/flash_attn, transformers 4.55.0 has strange behavior with model backward
99
+ run: |
100
+ pip3 install -r requirements-test.txt
101
+ pip3 install --no-deps -e .
102
+ pip3 install --upgrade "transformers<5.0.0"
103
+ - name: Running rmpad model tests on 8 L20 GPUs + flash_attn 2.5.8
104
+ run: |
105
+ pytest -s tests/models/test_transformer.py
106
+ - name: Running rmpad model tests on 8 L20 GPUs + latest flash_attn
107
+ run: |
108
+ pytest -s tests/models/test_transformer.py
109
+ - name: Running FSDP rmpad model tests on 8 L20 GPUs + latest flash_attn
110
+ run: |
111
+ STRATEGY=fsdp torchrun --nproc_per_node=8 tests/special_distributed/test_fsdp_ckpt.py
112
+ - name: Running transformers ulysses tests on 8 L20 GPUs + latest transformers
113
+ run: |
114
+ torchrun --nproc_per_node=8 -m pytest tests/models/test_transformers_ulysses.py
115
+ - name: Running transformers ulysses tests on 8 L20 GPUs + transformers 4.54.1
116
+ run: |
117
+ pip3 install transformers==4.54.1
118
+ torchrun --nproc_per_node=8 -m pytest tests/models/test_transformers_ulysses.py
119
+ - name: Run distributed test
120
+ run: |
121
+ bash tests/special_distributed/run_all.sh
122
+
123
+ # TODO: Move this back to model_rmpad once FSDP2 is stable.
124
+ # NOTE: List as an independent job to make rerun easier.
125
+ model_rmpad_fsdp2_unstable:
126
+ needs: setup
127
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
128
+ timeout-minutes: 20 # Increase this timeout value as needed
129
+ env:
130
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
131
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
132
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
133
+ HF_ENDPOINT: "https://hf-mirror.com"
134
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
135
+ steps:
136
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
137
+ with:
138
+ fetch-depth: 0
139
+ - name: Install the current repository and upgrade to latest transformers/flash_attn
140
+ run: |
141
+ pip3 install -r requirements-test.txt
142
+ pip3 install --no-deps -e .
143
+ - name: Running FSDP2 rmpad model tests on 8 L20 GPUs + latest flash_attn
144
+ run: |
145
+ STRATEGY=fsdp2 torchrun --nproc_per_node=8 tests/special_distributed/test_fsdp_ckpt.py
146
+
147
+ model_engine:
148
+ needs: setup
149
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
150
+ timeout-minutes: 20 # Increase this timeout value as needed
151
+ env:
152
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
153
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
154
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
155
+ HF_ENDPOINT: "https://hf-mirror.com"
156
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
157
+ steps:
158
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
159
+ with:
160
+ fetch-depth: 0
161
+ - name: Install the current repository
162
+ run: |
163
+ pip3 install -r requirements-test.txt
164
+ pip3 install --no-deps -e .
165
+ - name: Download model config files
166
+ run: |
167
+ hf download Qwen/Qwen2.5-0.5B-Instruct --local-dir $HOME/models/Qwen/Qwen2.5-0.5B-Instruct
168
+
169
+ - name: Running mcore engine tests on 8 L20 GPUs
170
+ run: |
171
+ ray stop --force
172
+ pytest -s -x tests/models/test_engine.py
173
+
174
+ cleanup:
175
+ runs-on: ubuntu-latest
176
+ needs: [setup, model_rmpad, model_rmpad_fsdp2_unstable, model_engine]
177
+ if: always()
178
+ steps:
179
+ - id: destroy-runner
180
+ uses: volcengine/vemlp-github-runner@v1
181
+ with:
182
+ mode: "destroy"
183
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
184
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/model_ascend.yml ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+ # name: Check PR Title
32
+
33
+ name: model_ascend
34
+
35
+ on:
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ pull_request:
43
+ branches:
44
+ - main
45
+ - v0.*
46
+ paths:
47
+ - "verl/**/*.py"
48
+ # Entrypoints
49
+ - ".github/workflows/model_ascend.yml"
50
+ - "tests/special_distributed/test_fsdp_ckpt.py"
51
+ - "tests/special_distributed/test_tensor_dict.py"
52
+ - "tests/models/**"
53
+ - "tests/special_distributed/run_all.sh"
54
+
55
+ # Cancel jobs on the same ref if a new one is triggered
56
+ concurrency:
57
+ group: ${{ github.workflow }}-${{ github.ref }}
58
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
59
+
60
+ permissions:
61
+ contents: read
62
+
63
+ jobs:
64
+ model_rmpad_ascend:
65
+ if: github.repository_owner == 'verl-project'
66
+ runs-on: linux-aarch64-a2b3-8
67
+ timeout-minutes: 60 # Increase this timeout value as needed
68
+ container:
69
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
70
+ options: >-
71
+ --shm-size 16g
72
+ env:
73
+ HF_ENDPOINT: "https://hf-mirror.com"
74
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
75
+ steps:
76
+ - name: Check npu and CANN info
77
+ run: |
78
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
79
+ npu-smi info
80
+ - name: Check initial pip list from image
81
+ run: |
82
+ pip list
83
+ - name: Checkout verl-project/verl repo
84
+ uses: actions/checkout@v4
85
+ with:
86
+ fetch-depth: 0
87
+ clean: true
88
+ - name: Install the current repository
89
+ run: |
90
+ pip install -r requirements-npu.txt
91
+ pip install --no-deps -e .[test]
92
+ - name: Check final pip list
93
+ run: |
94
+ pip list
95
+ - name: Prepare weights
96
+ run: |
97
+ ln -s /root/.cache/models ~/models
98
+ - name: Running rmpad model tests on 8 NPUs
99
+ run: |
100
+ pytest -s tests/models/test_transformer.py
101
+ - name: Running FSDP rmpad model tests on 8 NPUs
102
+ run: |
103
+ STRATEGY=fsdp torchrun --nproc_per_node=8 tests/special_distributed/test_fsdp_ckpt.py
104
+ - name: Running transformers ulysses tests on 8 NPUs
105
+ run: |
106
+ torchrun --nproc_per_node=8 -m pytest tests/models/test_transformers_ulysses.py
107
+ - name: Run distributed test
108
+ run: |
109
+ bash tests/special_distributed/run_all.sh
110
+
111
+ # TODO: Move this back to model_rmpad once FSDP2 is stable.
112
+ # NOTE: List as an independent job to make rerun easier.
113
+ model_rmpad_fsdp2_unstable_ascend:
114
+ if: github.repository_owner == 'verl-project'
115
+ runs-on: linux-aarch64-a2b3-8
116
+ timeout-minutes: 60
117
+ container:
118
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
119
+ options: >-
120
+ --shm-size 16g
121
+ env:
122
+ HF_ENDPOINT: "https://hf-mirror.com"
123
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
124
+ steps:
125
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
126
+ with:
127
+ fetch-depth: 0
128
+ - name: Install the current repository
129
+ run: |
130
+ pip install -r requirements-npu.txt
131
+ pip install --no-deps -e .[test]
132
+ - name: Prepare weights
133
+ run: |
134
+ ln -s /root/.cache/models ~/models
135
+ - name: Running FSDP2 rmpad model tests on 8 NPUs
136
+ run: |
137
+ STRATEGY=fsdp2 torchrun --nproc_per_node=8 tests/special_distributed/test_fsdp_ckpt.py
.github/workflows/nightly_ascend.yml ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: nightly_ci_ascend
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ # For push, for now only anti-patterns are specified so it is more conservative
38
+ # and achieves higher coverage.
39
+ schedule:
40
+ - cron: "0 17 * * *"
41
+
42
+ # Declare permissions just read content.
43
+ permissions:
44
+ contents: read
45
+
46
+ jobs:
47
+ # Test ppo qwen3-8b fsdp+vllm
48
+ nightlyCI_ppo-qwen3-8b-fsdp-vllm_ascend:
49
+ if: github.repository_owner == 'verl-project'
50
+ runs-on: linux-aarch64-a2b3-8
51
+ timeout-minutes: 180 # Increase this timeout value as needed
52
+ container:
53
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
54
+ options: >-
55
+ --shm-size 16g
56
+ env:
57
+ HF_ENDPOINT: "https://hf-mirror.com"
58
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
59
+ steps:
60
+ - name: Check npu and CANN info
61
+ run: |
62
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
63
+ npu-smi info
64
+ - name: Check initial pip list from image
65
+ run: |
66
+ pip list
67
+ - name: Checkout verl-project/verl repo
68
+ uses: actions/checkout@v4
69
+ with:
70
+ fetch-depth: 0
71
+ clean: true
72
+ - name: Install the current repository
73
+ run: |
74
+ pip install -r requirements-npu.txt
75
+ pip install --no-deps -e .
76
+ - name: Check final pip list
77
+ run: |
78
+ pip list
79
+ - name: Prepare weights
80
+ run: |
81
+ ln -s /root/.cache/models ~/models
82
+ - name: Prepare GSM8K dataset
83
+ run: |
84
+ python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
85
+ - name: Running nightlyCI_ppo-qwen3-8b-fsdp-vllm_ascend
86
+ run: |
87
+ ray stop --force
88
+ bash tests/special_npu/nightly_ci_ascend/run_ppo_qwen3-8b_fsdp_npu.sh
89
+
90
+ # Test grpo qwen25-7b-Instruct fsdp+vllm
91
+ nightlyCI_grpo-qwen25-7b-Instruct-fsdp-vllm_ascend:
92
+ if: github.repository_owner == 'verl-project'
93
+ runs-on: linux-aarch64-a2b3-8
94
+ timeout-minutes: 180 # Increase this timeout value as needed
95
+ container:
96
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
97
+ options: >-
98
+ --shm-size 16g
99
+ env:
100
+ HF_ENDPOINT: "https://hf-mirror.com"
101
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
102
+ steps:
103
+ - name: Check npu and CANN info
104
+ run: |
105
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
106
+ npu-smi info
107
+ - name: Check initial pip list from image
108
+ run: |
109
+ pip list
110
+ - name: Checkout verl-project/verl repo
111
+ uses: actions/checkout@v4
112
+ with:
113
+ fetch-depth: 0
114
+ clean: true
115
+ - name: Install the current repository
116
+ run: |
117
+ pip install -r requirements-npu.txt
118
+ pip install --no-deps -e .
119
+ - name: Check final pip list
120
+ run: |
121
+ pip list
122
+ - name: Prepare weights
123
+ run: |
124
+ ln -s /root/.cache/models ~/models
125
+ - name: Prepare GSM8K dataset
126
+ run: |
127
+ python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
128
+ - name: Running nightlyCI_grpo-qwen25-7b-Instruct-fsdp-vllm_ascend
129
+ run: |
130
+ ray stop --force
131
+ bash tests/special_npu/nightly_ci_ascend/run_grpo_qwen25-7b-instruct_fsdp_npu.sh
132
+
133
+ # Test grpo qwen25-vl-3b-Instruct fsdp+vllm
134
+ nightlyCI_grpo-qwen25-vl-3b-Instruct-fsdp-vllm_ascend:
135
+ if: github.repository_owner == 'verl-project'
136
+ runs-on: linux-aarch64-a2b3-8
137
+ timeout-minutes: 180 # Increase this timeout value as needed
138
+ container:
139
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
140
+ options: >-
141
+ --shm-size 16g
142
+ env:
143
+ HF_ENDPOINT: "https://hf-mirror.com"
144
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
145
+ steps:
146
+ - name: Check npu and CANN info
147
+ run: |
148
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
149
+ npu-smi info
150
+ - name: Check initial pip list from image
151
+ run: |
152
+ pip list
153
+ - name: Checkout verl-project/verl repo
154
+ uses: actions/checkout@v4
155
+ with:
156
+ fetch-depth: 0
157
+ clean: true
158
+ - name: Install the current repository
159
+ run: |
160
+ pip install -r requirements-npu.txt
161
+ pip install --no-deps -e .
162
+ - name: Check final pip list
163
+ run: |
164
+ pip list
165
+ - name: Prepare weights
166
+ run: |
167
+ ln -s /root/.cache/models ~/models
168
+ - name: Preprocess geo3k dataset
169
+ run: |
170
+ python examples/data_preprocess/geo3k.py --local_dataset_path ${HOME}/.cache/datasets/hiyouga/geometry3k
171
+ - name: Running nightlyCI_grpo-qwen25-vl-3b-Instruct-fsdp-vllm_ascend
172
+ run: |
173
+ ray stop --force
174
+ bash tests/special_npu/nightly_ci_ascend/run_grpo_qwen25-vl-3b-instruct_fsdp_npu.sh
.github/workflows/npu_unit_tests.yml ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - `npu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix on ascend device.
29
+ # - Since cpu/gpu/npu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
30
+ # - new workflow yaml is added to `.github/workflows`
31
+ # - new tests are added to workflow mentioned in 2.
32
+
33
+ name: NPU unit tests
34
+
35
+ on:
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ paths:
43
+ - "**/*.py"
44
+ - .github/workflows/npu_unit_tests.yml
45
+ pull_request:
46
+ branches:
47
+ - main
48
+ paths:
49
+ # The order that you define paths patterns matters:
50
+ # A matching negative pattern (prefixed with !) after a positive match will exclude the path.
51
+ # A matching positive pattern after a negative match will include the path again.
52
+ - "**/*.py"
53
+ # Other entrypoints
54
+ - "!examples/**"
55
+ - "!verl/trainer/main_*.py"
56
+ - "!verl/trainer/fsdp_sft_trainer.py"
57
+ - "!recipe/**"
58
+ # Entrypoints
59
+ - .github/workflows/npu_unit_tests.yml
60
+ - "tests/**test_*.py"
61
+ # Ignore CPU tests
62
+ - "!tests/*_on_cpu.py"
63
+
64
+ # Cancel jobs on the same ref if a new one is triggered
65
+ concurrency:
66
+ group: ${{ github.workflow }}-${{ github.ref }}
67
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
68
+
69
+ # Declare permissions just read content.
70
+ permissions:
71
+ contents: read
72
+
73
+ jobs:
74
+ npu_unit_tests:
75
+ if: github.repository_owner == 'verl-project'
76
+ runs-on: linux-aarch64-a2b3-8
77
+ timeout-minutes: 60 # Increase this timeout value as needed
78
+ container:
79
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
80
+ options: >-
81
+ --shm-size 16g
82
+ env:
83
+ HF_ENDPOINT: "https://hf-mirror.com"
84
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
85
+ steps:
86
+ - name: Check npu and CANN info
87
+ run: |
88
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
89
+ npu-smi info
90
+ - name: Check initial pip list from image
91
+ run: |
92
+ pip list
93
+ - name: Checkout volcengine/verl repo
94
+ uses: actions/checkout@v4
95
+ with:
96
+ fetch-depth: 0
97
+ clean: true
98
+ - name: Install the current repository
99
+ run: |
100
+ pip install -r requirements-npu.txt
101
+ pip install --no-deps -e .[test]
102
+ pip install mlflow pytest-asyncio
103
+ - name: Check final pip list
104
+ run: |
105
+ pip list
106
+ - name: Prepare weights
107
+ run: |
108
+ ln -s /root/.cache/models ~/models
109
+ - name: Run all NPU unit tests
110
+ run: |
111
+ pytest -s -x --ignore-glob="*test_special_*.py" --ignore-glob="*on_cpu.py" --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob="tests/special*" --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" --ignore-glob="*test_rvdz*" --ignore-glob="*test_ray_collectives*" --ignore-glob="*test_nvtx_profile*" --ignore-glob="tests/checkpoint_engine" --ignore-glob="*test_shared_memory*" --ignore-glob="tests/workers/rollout/rollout_trtllm" --ignore-glob="*test_fsdp_lora_merge*" --ignore-glob="*test_activation_offload*" --ignore-glob="*test_normalize_peft_param_name.py*" tests/
112
+ - name: Testing activation offload
113
+ run: |
114
+ pytest -s -x tests/utils/test_activation_offload.py
115
+ - name: Testing normalize peft param name
116
+ run: |
117
+ pytest -s -x tests/utils/test_normalize_peft_param_name.py
118
+ - name: Testing FSDP2 actor functionality
119
+ run: |
120
+ torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/actor/test_special_dp_actor.py
121
+ - name: Testing FSDP2 critic functionality
122
+ run: |
123
+ torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/critic/test_special_dp_critic.py
124
+ - name: Running NPU profiling unit tests
125
+ run: |
126
+ pytest -s -x tests/utils/test_special_mstx_profile.py
.github/workflows/pre-commit.yml ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # c.f. https://github.com/pre-commit/action?tab=readme-ov-file#using-this-action
2
+ name: pre-commit
3
+
4
+ # No need to avoid / cancel lightweight pre-commit jobs
5
+ on:
6
+ schedule:
7
+ - cron: "0 0 * * 0"
8
+ pull_request:
9
+ push:
10
+ branches:
11
+ - main
12
+ - v0.*
13
+ # Allow manual triggering
14
+ workflow_dispatch:
15
+
16
+ # Declare permissions just read content.
17
+ permissions:
18
+ contents: read
19
+
20
+ jobs:
21
+ pre-commit:
22
+ runs-on: ubuntu-latest
23
+ strategy:
24
+ matrix:
25
+ python-version: ["3.12"]
26
+ steps:
27
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
28
+ - name: Set up Python ${{ matrix.python-version }}
29
+ uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
30
+ with:
31
+ python-version: ${{ matrix.python-version }}
32
+ - name: Install the current repository
33
+ run: |
34
+ pip install pre-commit hydra-core
35
+ pip install --no-deps -e .
36
+ - name: Set ruff --output-format=github
37
+ run: |
38
+ sed -i 's/--output-format=full/--output-format=github/' .pre-commit-config.yaml
39
+ git add .pre-commit-config.yaml
40
+ # Check "--all-files" by default
41
+ - uses: pre-commit/action@v3.0.1
.github/workflows/precommit-autofix.yml ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: scheduled pre-commit autofix
2
+
3
+ on:
4
+ schedule:
5
+ # Every hour
6
+ - cron: "0 * * * *"
7
+ workflow_dispatch:
8
+
9
+ permissions:
10
+ contents: write
11
+ pull-requests: write
12
+
13
+ jobs:
14
+ precommit:
15
+ if: github.repository_owner == 'verl-project'
16
+ runs-on: ubuntu-latest
17
+
18
+ steps:
19
+ - name: Checkout repository
20
+ uses: actions/checkout@v4
21
+ with:
22
+ fetch-depth: 0
23
+
24
+ - name: Set up Python
25
+ uses: actions/setup-python@v5
26
+ with:
27
+ python-version: "3.10"
28
+
29
+ - name: Install pre-commit
30
+ run: |
31
+ python -m pip install --upgrade pip
32
+ pip install pre-commit hydra-core
33
+
34
+ - name: Run pre-commit
35
+ run: |
36
+ pre-commit run --all-files || true
37
+
38
+ - name: Create or update PR
39
+ uses: peter-evans/create-pull-request@v6
40
+ with:
41
+ branch: bot/precommit-autofix
42
+ delete-branch: true
43
+ title: "[ci] chore: scheduled pre-commit autofix"
44
+ commit-message: "chore: auto-fix pre-commit issues"
45
+ body: |
46
+ This PR was created automatically by a scheduled GitHub Action.
47
+
48
+ - Runs `pre-commit run --all-files`
49
+ - Triggered hourly
50
+ labels: |
51
+ automated
52
+ pre-commit
.github/workflows/reward_model_sglang.yml ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+ # name: Check PR Title
32
+
33
+ name: reward_model_sglang
34
+
35
+ on:
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ pull_request:
43
+ branches:
44
+ - main
45
+ - v0.*
46
+ paths:
47
+ - "verl/**/*.py"
48
+ # Entrypoints
49
+ - ".github/workflows/reward_model_sglang.yml"
50
+ - "tests/experimental/reward_loop/**"
51
+
52
+ # Cancel jobs on the same ref if a new one is triggered
53
+ concurrency:
54
+ group: ${{ github.workflow }}-${{ github.ref }}
55
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
56
+
57
+ # Declare permissions just read content.
58
+ permissions:
59
+ contents: read
60
+
61
+ env:
62
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
63
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
64
+
65
+ jobs:
66
+ setup:
67
+ if: github.repository_owner == 'verl-project'
68
+ runs-on: ubuntu-latest
69
+ outputs:
70
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
71
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
72
+ steps:
73
+ - uses: actions/checkout@v4
74
+ - id: create-runner
75
+ uses: volcengine/vemlp-github-runner@v1
76
+ with:
77
+ mode: "create"
78
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
79
+ mlp-image: "${{ env.IMAGE }}"
80
+
81
+ reward_model_sglang:
82
+ needs: setup
83
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
84
+ timeout-minutes: 30 # Increase this timeout value as needed
85
+ env:
86
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
87
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
88
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
89
+ HF_ENDPOINT: "https://hf-mirror.com"
90
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
91
+ SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
92
+ NCCL_SHM_DISABLE: "1"
93
+ NCCL_P2P_DISABLE: "1"
94
+ steps:
95
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
96
+ with:
97
+ fetch-depth: 0
98
+ - name: Install the current repository
99
+ run: |
100
+ pip3 install -r requirements-test.txt
101
+ pip3 install --no-deps -e .
102
+ pip3 install sglang-router==0.2.2
103
+ - name: Prepare gsm8k dataset
104
+ run: |
105
+ ray stop --force
106
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_dir ${HOME}/data/gsm8k
107
+ - name: Running sglang generative reward model tests on 8 L20 GPUs
108
+ run: |
109
+ unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
110
+ ROLLOUT_NAME=sglang pytest -s -x tests/experimental/reward_loop/test_reward_model_genrm.py
111
+ - name: Running sglang discriminative reward model tests on 8 L20 GPUs
112
+ run: |
113
+ unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
114
+ ROLLOUT_NAME=sglang pytest -s -x tests/experimental/reward_loop/test_reward_model_disrm.py
115
+ - name: Running sglang agent loop with reward manager tests on 8 L20 GPUs
116
+ run: |
117
+ unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
118
+ ROLLOUT_NAME=sglang pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_standalone.py
119
+ - name: Running sglang agent loop with reward model colocate tests on 8 L20 GPUs
120
+ run: |
121
+ unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
122
+ ROLLOUT_NAME=sglang pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_colocate.py
123
+
124
+ cleanup:
125
+ runs-on: ubuntu-latest
126
+ needs: [setup, reward_model_sglang]
127
+ if: always()
128
+ steps:
129
+ - id: destroy-runner
130
+ uses: volcengine/vemlp-github-runner@v1
131
+ with:
132
+ mode: "destroy"
133
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
134
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/reward_model_vllm.yml ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+ # name: Check PR Title
32
+
33
+ name: reward_model_vllm
34
+
35
+ on:
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ pull_request:
43
+ branches:
44
+ - main
45
+ - v0.*
46
+ paths:
47
+ - "verl/**/*.py"
48
+ # Entrypoints
49
+ - ".github/workflows/reward_model_vllm.yml"
50
+ - "tests/experimental/reward_loop/**"
51
+
52
+ # Cancel jobs on the same ref if a new one is triggered
53
+ concurrency:
54
+ group: ${{ github.workflow }}-${{ github.ref }}
55
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
56
+
57
+ # Declare permissions just read content.
58
+ permissions:
59
+ contents: read
60
+
61
+ env:
62
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
63
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
64
+
65
+ jobs:
66
+ setup:
67
+ if: github.repository_owner == 'verl-project'
68
+ runs-on: ubuntu-latest
69
+ outputs:
70
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
71
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
72
+ steps:
73
+ - uses: actions/checkout@v4
74
+ - id: create-runner
75
+ uses: volcengine/vemlp-github-runner@v1
76
+ with:
77
+ mode: "create"
78
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
79
+ mlp-image: "${{ env.IMAGE }}"
80
+
81
+ reward_model_vllm:
82
+ needs: setup
83
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
84
+ timeout-minutes: 30 # Increase this timeout value as needed
85
+ env:
86
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
87
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
88
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
89
+ HF_ENDPOINT: "https://hf-mirror.com"
90
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
91
+ SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
92
+ NCCL_SHM_DISABLE: "1"
93
+ NCCL_P2P_DISABLE: "1"
94
+ steps:
95
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
96
+ with:
97
+ fetch-depth: 0
98
+ - name: Install the current repository
99
+ run: |
100
+ pip3 install -r requirements-test.txt
101
+ pip3 install --no-deps -e .
102
+ - name: Prepare gsm8k dataset
103
+ run: |
104
+ ray stop --force
105
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_dir ${HOME}/data/gsm8k
106
+ - name: Running vllm generative reward model tests on 8 L20 GPUs
107
+ run: |
108
+ unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
109
+ ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_reward_model_genrm.py
110
+ - name: Running vllm discriminative reward model tests on 8 L20 GPUs
111
+ run: |
112
+ unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
113
+ ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_reward_model_disrm.py
114
+
115
+ - name: Running vllm agent loop with reward manager tests on 8 L20 GPUs
116
+ run: |
117
+ unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
118
+ ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_standalone.py
119
+ - name: Running vllm agent loop with reward model colocate tests on 8 L20 GPUs
120
+ run: |
121
+ unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
122
+ ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_colocate.py
123
+
124
+ cleanup:
125
+ runs-on: ubuntu-latest
126
+ needs: [setup, reward_model_vllm]
127
+ if: always()
128
+ steps:
129
+ - id: destroy-runner
130
+ uses: volcengine/vemlp-github-runner@v1
131
+ with:
132
+ mode: "destroy"
133
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
134
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/reward_model_vllm_ascend.yml ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+ # name: Check PR Title
32
+
33
+ name: reward_model_vllm_ascend
34
+
35
+ on:
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ pull_request:
43
+ branches:
44
+ - main
45
+ - v0.*
46
+ paths:
47
+ - "verl/**/*.py"
48
+ # Entrypoints
49
+ - ".github/workflows/reward_model_vllm_ascend.yml"
50
+ - "tests/experimental/reward_loop/**"
51
+
52
+ # Cancel jobs on the same ref if a new one is triggered
53
+ concurrency:
54
+ group: ${{ github.workflow }}-${{ github.ref }}
55
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
56
+
57
+ # Declare permissions just read content.
58
+ permissions:
59
+ contents: read
60
+
61
+ jobs:
62
+ reward_model_vllm_ascend:
63
+ if: github.repository_owner == 'verl-project'
64
+ runs-on: linux-aarch64-a2b3-8
65
+ timeout-minutes: 60 # Increase this timeout value as needed
66
+ container:
67
+ image: swr.cn-southwest-2.myhuaweicloud.com/modelfoundry/ascend-ci/verl/verl:verl-8.5.0-910b-ubuntu22.04-py3.11-latest
68
+ options: >-
69
+ --shm-size 16g
70
+ env:
71
+ HF_ENDPOINT: "https://hf-mirror.com"
72
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
73
+ steps:
74
+ - name: Check npu and CANN info
75
+ run: |
76
+ cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
77
+ npu-smi info
78
+ - name: Check initial pip list from image
79
+ run: |
80
+ pip list
81
+ - name: Checkout verl-project/verl repo
82
+ uses: actions/checkout@v4
83
+ with:
84
+ fetch-depth: 0
85
+ clean: true
86
+ - name: Install the current repository
87
+ run: |
88
+ pip install -r requirements-npu.txt
89
+ pip install --no-deps -e .[test]
90
+ - name: Check final pip list
91
+ run: |
92
+ pip list
93
+ - name: Prepare weights
94
+ run: |
95
+ ln -s /root/.cache/models ~/models
96
+ - name: Prepare gsm8k dataset
97
+ run: |
98
+ ray stop --force
99
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k --local_dir ${HOME}/data/gsm8k
100
+ - name: Running vllm generative reward model tests on 8 NPUs
101
+ run: |
102
+ ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_reward_model_genrm.py
103
+ - name: Running vllm discriminative reward model tests on 8 NPUs
104
+ run: |
105
+ ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_reward_model_disrm.py
106
+ - name: Running vllm agent loop with reward manager tests on 8 NPUs
107
+ run: |
108
+ ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_standalone.py
109
+ - name: Running vllm agent loop with reward model colocate tests on 8 NPUs
110
+ run: |
111
+ export HCCL_HOST_SOCKET_PORT_RANGE=auto
112
+ export HCCL_NPU_SOCKET_PORT_RANGE=auto
113
+ ROLLOUT_NAME=vllm pytest -s -x tests/experimental/reward_loop/test_agent_reward_loop_colocate.py
.github/workflows/sanity.yml ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+ # name: Check PR Title
32
+
33
+ name: sanity
34
+
35
+ on:
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ pull_request:
43
+ branches:
44
+ - main
45
+ - v0.*
46
+ paths:
47
+ - "**/*.py"
48
+ - .github/workflows/sanity.yml
49
+ - "tests/special_sanity/**"
50
+
51
+ # Cancel jobs on the same ref if a new one is triggered
52
+ concurrency:
53
+ group: ${{ github.workflow }}-${{ github.ref }}
54
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
55
+
56
+ # Declare permissions just read content.
57
+ permissions:
58
+ contents: read
59
+
60
+ jobs:
61
+ sanity:
62
+ runs-on: ubuntu-latest
63
+ timeout-minutes: 5 # Increase this timeout value as needed
64
+ strategy:
65
+ matrix:
66
+ python-version: ["3.10"]
67
+ steps:
68
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
69
+ - name: Set up Python ${{ matrix.python-version }}
70
+ uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
71
+ with:
72
+ python-version: ${{ matrix.python-version }}
73
+ - name: Install the current repository
74
+ run: |
75
+ pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu
76
+ pip3 install -r requirements.txt
77
+ pip3 install -r requirements-test.txt
78
+ pip3 install --no-deps -e .
79
+ - name: Run sanity test
80
+ run: |
81
+ pytest -s -x tests/special_sanity
82
+ - name: Run license test
83
+ run: |
84
+ python3 tests/special_sanity/check_license.py --directories .
85
+ - name: Assert naming convention
86
+ run: |
87
+ if grep -rIn --exclude-dir=.git --exclude-dir=.github --exclude-dir=venv --exclude-dir=__pycache__ 'veRL' .; then
88
+ echo "Please use verl instead of veRL in the codebase"
89
+ exit 1
90
+ fi
91
+ - name: Assert SGLang naming convention
92
+ run: |
93
+ if grep -rIn --exclude-dir=.git --exclude-dir=.github --exclude-dir=venv --exclude-dir=__pycache__ --exclude=ascend_sglang_best_practices.rst -E 'Sglang|sgLang|sglAng|sglaNg|sglanG' .; then
94
+ echo "Please use SGLang or sglang as the formal name of SGLang rollout engine"
95
+ exit 1
96
+ fi
97
+ - name: Validate test folder structure
98
+ run: python3 tests/special_sanity/validate_structure.py
99
+ - name: Assert documentation requirement for functions
100
+ run: python3 tests/special_sanity/validate_imported_docs.py
101
+ - name: Assert device api usage in verl/verl
102
+ run: python3 tests/special_sanity/check_device_api_usage.py --directory ./verl
103
+ - name: Assert documentation time info
104
+ run: python3 tests/special_sanity/check_docs_time_info.py
105
+ - name: Check docstrings for specified files
106
+ run: python3 tests/special_sanity/check_docstrings.py
107
+ - name: Check DataProto for specified folders
108
+ run: python3 tests/special_sanity/check_dataproto_usage.py -d ./verl/workers/engine
.github/workflows/scorecard.yml ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # This workflow uses actions that are not certified by GitHub. They are provided
2
+ # by a third-party and are governed by separate terms of service, privacy
3
+ # policy, and support documentation.
4
+
5
+ name: Scorecard supply-chain security
6
+ on:
7
+ # For Branch-Protection check. Only the default branch is supported. See
8
+ # https://github.com/ossf/scorecard/blob/main/docs/checks.md#branch-protection
9
+ branch_protection_rule:
10
+ # To guarantee Maintained check is occasionally updated. See
11
+ # https://github.com/ossf/scorecard/blob/main/docs/checks.md#maintained
12
+ schedule:
13
+ - cron: "27 7 * * 1"
14
+ push:
15
+ branches:
16
+ - main
17
+ - v0.*
18
+
19
+ # Declare default permissions as read only.
20
+ permissions: read-all
21
+
22
+ jobs:
23
+ analysis:
24
+ name: Scorecard analysis
25
+ runs-on: ubuntu-latest
26
+ permissions:
27
+ # Needed to upload the results to code-scanning dashboard.
28
+ security-events: write
29
+ # Needed to publish results and get a badge (see publish_results below).
30
+ id-token: write
31
+ # Uncomment the permissions below if installing in a private repository.
32
+ # contents: read
33
+ # actions: read
34
+
35
+ steps:
36
+ - name: "Checkout code"
37
+ uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
38
+ with:
39
+ persist-credentials: false
40
+
41
+ - name: "Run analysis"
42
+ uses: ossf/scorecard-action@0864cf19026789058feabb7e87baa5f140aac736 # v2.3.1
43
+ with:
44
+ results_file: results.sarif
45
+ results_format: sarif
46
+ # (Optional) "write" PAT token. Uncomment the `repo_token` line below if:
47
+ # - you want to enable the Branch-Protection check on a *public* repository, or
48
+ # - you are installing Scorecard on a *private* repository
49
+ # To create the PAT, follow the steps in https://github.com/ossf/scorecard-action?tab=readme-ov-file#authentication-with-fine-grained-pat-optional.
50
+ # repo_token: ${{ secrets.SCORECARD_TOKEN }}
51
+
52
+ # Public repositories:
53
+ # - Publish results to OpenSSF REST API for easy access by consumers
54
+ # - Allows the repository to include the Scorecard badge.
55
+ # - See https://github.com/ossf/scorecard-action#publishing-results.
56
+ # For private repositories:
57
+ # - `publish_results` will always be set to `false`, regardless
58
+ # of the value entered here.
59
+ publish_results: true
60
+
61
+ # Upload the results to GitHub's code scanning dashboard (optional).
62
+ # Commenting out will disable upload of results to your repo's Code Scanning dashboard
63
+ - name: "Upload to code-scanning"
64
+ uses: github/codeql-action/upload-sarif@9e8d0789d4a0fa9ceb6b1738f7e269594bdd67f0 #v3.28.9
65
+ with:
66
+ sarif_file: results.sarif
.github/workflows/secrets_scan.yml ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ on:
2
+ push:
3
+ branches:
4
+ - main
5
+ - v0.*
6
+ pull_request:
7
+
8
+ permissions:
9
+ contents: read
10
+
11
+ jobs:
12
+ test:
13
+ runs-on: ubuntu-latest
14
+ steps:
15
+ - name: Checkout code
16
+ uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
17
+ with:
18
+ fetch-depth: 0
19
+ - name: Secret Scanning
20
+ uses: trufflesecurity/trufflehog@7dc056a193116ba8d82154bf0549381c8fb8545c # v3.88.14
21
+ with:
22
+ extra_args: --results=verified,unknown
.github/workflows/sgl.yml ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: sgl
33
+
34
+ on:
35
+ # workflow_dispatch: # Manual
36
+ # Trigger the workflow on push or pull request,
37
+ # but only for the main branch
38
+ push:
39
+ branches:
40
+ - main
41
+ - v0.*
42
+ paths:
43
+ - "**/*.py"
44
+ - .github/workflows/sgl.yml
45
+ pull_request:
46
+ branches:
47
+ - main
48
+ - v0.*
49
+ paths:
50
+ - "**/*.py"
51
+ # Other entrypoints
52
+ - "!examples/**"
53
+ - "!tests/**"
54
+ - "!verl/trainer/main_*.py"
55
+ - "!verl/trainer/fsdp_sft_trainer.py" # FSDP
56
+ - "!verl/workers/**/*dp_*.py"
57
+ # Megatron
58
+ - "!verl/workers/**/megatron_*.py"
59
+ # vLLM
60
+ - "!**/*vllm*"
61
+
62
+ # Entrypoints
63
+ - ".github/workflows/sgl.yml"
64
+ - "tests/rollout/*sglang*"
65
+ - "tests/rollout/async_rollout_utils.py"
66
+ - "tests/workers/rollout/*interaction*"
67
+
68
+ # Cancel jobs on the same ref if a new one is triggered
69
+ concurrency:
70
+ group: ${{ github.workflow }}-${{ github.ref }}
71
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
72
+
73
+ # Declare permissions just read content.
74
+ permissions:
75
+ contents: read
76
+
77
+ env:
78
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:sgl059.dev2"
79
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
80
+
81
+ jobs:
82
+ setup:
83
+ if: github.repository_owner == 'verl-project'
84
+ runs-on: ubuntu-latest
85
+ outputs:
86
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
87
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
88
+ steps:
89
+ - uses: actions/checkout@v4
90
+ - id: create-runner
91
+ uses: volcengine/vemlp-github-runner@v1
92
+ with:
93
+ mode: "create"
94
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
95
+ mlp-image: "${{ env.IMAGE }}"
96
+
97
+ sgl:
98
+ needs: setup
99
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
100
+ timeout-minutes: 35 # Increase this timeout value as needed
101
+ env:
102
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
103
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
104
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
105
+ HF_ENDPOINT: "https://hf-mirror.com"
106
+ HF_HUB_ENABLE_HF_TRANSFER: 1
107
+ SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
108
+ NCCL_SHM_DISABLE: "1"
109
+ NCCL_P2P_DISABLE: "1"
110
+ steps:
111
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
112
+ with:
113
+ fetch-depth: 0
114
+ - name: Install the current repository
115
+ run: |
116
+ pip3 install cupy-cuda12x==13.6.0 pytest-asyncio
117
+ pip3 install hf_transfer fastmcp pytest-asyncio
118
+ pip3 install -r requirements-test.txt
119
+ pip3 install --no-deps -e .
120
+ - name: Prepare gsm8k dataset
121
+ run: |
122
+ ray stop --force
123
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
124
+ - name: Test the latest SGLang Rollout async with agent loop
125
+ run: |
126
+ ROLLOUT_NAME=sglang pytest -svvv tests/experimental/agent_loop
127
+
128
+ sgl_checkpoint_engine:
129
+ needs: setup
130
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
131
+ timeout-minutes: 35 # Increase this timeout value as needed
132
+ env:
133
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
134
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
135
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
136
+ HF_ENDPOINT: "https://hf-mirror.com"
137
+ HF_HUB_ENABLE_HF_TRANSFER: 1
138
+ SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
139
+ NCCL_SHM_DISABLE: "1"
140
+ NCCL_P2P_DISABLE: "1"
141
+ steps:
142
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
143
+ with:
144
+ fetch-depth: 0
145
+ - name: Install the current repository
146
+ run: |
147
+ pip3 install cupy-cuda12x==13.6.0 pytest-asyncio
148
+ pip3 install hf_transfer fastmcp pytest-asyncio
149
+ pip3 install -r requirements-test.txt
150
+ pip3 install --no-deps -e .
151
+ - name: Test SGLang ServerAdapter with Checkpoint Engine (NCCL)
152
+ run: |
153
+ ROLLOUT_NAME=sglang pytest -svvv tests/checkpoint_engine/test_special_server_adapter.py
154
+
155
+ cleanup:
156
+ runs-on: ubuntu-latest
157
+ needs: [setup, sgl, sgl_checkpoint_engine]
158
+ if: always()
159
+ steps:
160
+ - id: destroy-runner
161
+ uses: volcengine/vemlp-github-runner@v1
162
+ with:
163
+ mode: "destroy"
164
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
165
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.github/workflows/type-coverage-check.yml ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Type Annotation and Docstring Coverage
2
+
3
+ on:
4
+ pull_request:
5
+ paths:
6
+ - '**/*.py'
7
+ - '.github/workflows/type-coverage-check.yml'
8
+
9
+ jobs:
10
+ type-coverage-check:
11
+ runs-on: ubuntu-latest
12
+ steps:
13
+ - uses: actions/checkout@v4
14
+ with:
15
+ fetch-depth: 0 # 🚨 Important: fetch full history so `origin/main` is available
16
+ - name: Set up Python
17
+ uses: actions/setup-python@v5
18
+ with:
19
+ python-version: '3.10'
20
+
21
+ - name: Install dependencies
22
+ run: |
23
+ pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu
24
+ pip3 install -r requirements.txt
25
+ pip3 install --no-deps -e .
26
+ - name: Run type annotation coverage check
27
+ run: |
28
+ python3 tests/special_sanity/type_coverage_check.py
29
+ - name: Run docstring coverage check
30
+ run: |
31
+ python3 tests/special_sanity/check_api_docs.py verl
.github/workflows/vllm.yml ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Tests layout
2
+
3
+ # Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4
+ # - `tests/trainer` for testing functionality related to `verl/trainer`
5
+ # - `tests/models` for testing functionality related to `verl/models`
6
+ # - ...
7
+
8
+ # There are a few folders with `special_` prefix, created for special purposes:
9
+ # - `special_distributed`: unit tests that must run with multiple GPUs
10
+ # - `special_e2e`: end-to-end tests with training/generation scripts
11
+ # - `special_npu`: tests for NPUs
12
+ # - `special_sanity`: a suite of quick sanity tests
13
+ # - `special_standalone`: a set of test that are designed to run in dedicated environments
14
+
15
+ # Accelerators for tests
16
+ # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17
+ # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18
+
19
+ # # Workflow layout
20
+
21
+ # All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22
+ # 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23
+ # 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24
+ # 3. End-to-end tests: `e2e_*.yml`
25
+ # 4. Unit tests
26
+ # - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27
+ # - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28
+ # - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29
+ # - new workflow yaml is added to `.github/workflows`
30
+ # - new tests are added to workflow mentioned in 2.
31
+
32
+ name: vllm
33
+
34
+ on:
35
+ # Trigger the workflow on push or pull request,
36
+ # but only for the main branch
37
+ push:
38
+ branches:
39
+ - main
40
+ - v0.*
41
+ pull_request:
42
+ branches:
43
+ - main
44
+ - v0.*
45
+ paths:
46
+ - "**/*.py"
47
+ # Other entrypoints
48
+ - "!examples/**"
49
+ - "!tests/**"
50
+ - "!verl/trainer/main_*.py"
51
+ - "!verl/trainer/fsdp_sft_trainer.py"
52
+ # FSDP
53
+ - "!verl/workers/**/*dp_*.py"
54
+ # Megatron
55
+ - "!verl/workers/**/megatron_*.py"
56
+ # SGLang
57
+ - "!**/*sglang*"
58
+ # Entrypoints
59
+ - ".github/workflows/vllm.yml"
60
+ - "tests/special_e2e/generation"
61
+ - "tests/workers/rollout"
62
+ - "verl/trainer/main_generation.py"
63
+ - "verl/trainer/config/generation.yaml"
64
+
65
+ # Cancel jobs on the same ref if a new one is triggered
66
+ concurrency:
67
+ group: ${{ github.workflow }}-${{ github.ref }}
68
+ cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
69
+
70
+ # Declare permissions just read content.
71
+ permissions:
72
+ contents: read
73
+
74
+ env:
75
+ IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:vllm017.dev2"
76
+ DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
77
+
78
+ jobs:
79
+ setup:
80
+ if: github.repository_owner == 'verl-project'
81
+ runs-on: ubuntu-latest
82
+ outputs:
83
+ runner-label: ${{ steps.create-runner.outputs.runner-label }}
84
+ mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
85
+ steps:
86
+ - uses: actions/checkout@v4
87
+ - id: create-runner
88
+ uses: volcengine/vemlp-github-runner@v1
89
+ with:
90
+ mode: "create"
91
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
92
+ mlp-image: "${{ env.IMAGE }}"
93
+
94
+ vllm:
95
+ needs: setup
96
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
97
+ timeout-minutes: 35 # Increase this timeout value as needed
98
+ env:
99
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
100
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
101
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
102
+ HF_ENDPOINT: "https://hf-mirror.com"
103
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
104
+ steps:
105
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
106
+ with:
107
+ fetch-depth: 0
108
+ - name: Install the current repository
109
+ run: |
110
+ pip3 install -r requirements-test.txt
111
+ pip3 install --no-deps -e .
112
+ pip3 install --upgrade "transformers<5.0"
113
+ # - name: Download Model to Use
114
+ # run: |
115
+ # hf download Qwen/Qwen2.5-0.5B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct
116
+ # hf download Qwen/Qwen2.5-1.5B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-1.5B-Instruct
117
+ # hf download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ${HOME}/models/Qwen/Qwen2.5-VL-3B-Instruct
118
+ # hf download OldKingMeister/Qwen2.5-1.5B-Instruct-YaRN --local-dir ${HOME}/models/OldKingMeister/Qwen2.5-1.5B-Instruct-YaRN
119
+ # export HF_HUB_OFFLINE=1
120
+ - name: Prepare gsm8k dataset
121
+ run: |
122
+ ray stop --force
123
+ python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
124
+ - name: Test the latest vLLM Rollout async with agent loop
125
+ run: |
126
+ ROLLOUT_NAME=vllm pytest -svvv tests/experimental/agent_loop
127
+ - name: Test vllm server abort functionality
128
+ run: |
129
+ pytest tests/workers/rollout/rollout_vllm/test_vllm_abort.py -v -s
130
+
131
+ vllm_checkpoint_engine:
132
+ needs: setup
133
+ runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
134
+ timeout-minutes: 35 # Increase this timeout value as needed
135
+ env:
136
+ HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
137
+ HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
138
+ NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
139
+ HF_ENDPOINT: "https://hf-mirror.com"
140
+ HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
141
+ steps:
142
+ - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
143
+ with:
144
+ fetch-depth: 0
145
+ - name: Install the current repository
146
+ run: |
147
+ pip3 install pytest-asyncio
148
+ pip3 install -r requirements-test.txt
149
+ pip3 install --no-deps -e .
150
+ pip3 install --upgrade "transformers<5.0"
151
+ pip3 install cupy-cuda12x==13.6.0
152
+ - name: Test vLLM ServerAdapter with Checkpoint Engine (NCCL)
153
+ run: |
154
+ ROLLOUT_NAME=vllm pytest -svvv tests/checkpoint_engine/test_special_server_adapter.py
155
+ - name: Test bucketed weight transfer
156
+ run: |
157
+ pytest -svvv tests/utils/test_bucketed_weight_transfer.py
158
+
159
+ cleanup:
160
+ runs-on: ubuntu-latest
161
+ needs: [setup, vllm, vllm_checkpoint_engine]
162
+ if: always()
163
+ steps:
164
+ - id: destroy-runner
165
+ uses: volcengine/vemlp-github-runner@v1
166
+ with:
167
+ mode: "destroy"
168
+ faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
169
+ mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"
.gitignore ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ **/*.pt
2
+ **/checkpoints
3
+ **/wget-log
4
+ **/_build/
5
+ **/*.ckpt
6
+ **/outputs
7
+ **/*.tar.gz
8
+ **/playground
9
+ **/wandb
10
+
11
+ /pyrightconfig.json
12
+
13
+ # Byte-compiled / optimized / DLL files
14
+ __pycache__/
15
+ *.py[cod]
16
+ *$py.class
17
+ dataset/*
18
+ tensorflow/my_graph/*
19
+ .idea/
20
+ # C extensions
21
+ *.so
22
+
23
+ # Distribution / packaging
24
+ .Python
25
+ # env/
26
+ build/
27
+ develop-eggs/
28
+ dist/
29
+ downloads/
30
+ eggs/
31
+ .eggs/
32
+ lib/
33
+ lib64/
34
+ parts/
35
+ sdist/
36
+ var/
37
+ tmp/
38
+ *.egg-info/
39
+ .installed.cfg
40
+ *.egg
41
+
42
+ # PyInstaller
43
+ # Usually these files are written by a python script from a template
44
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
45
+ *.manifest
46
+ *.spec
47
+
48
+ # Installer logs
49
+ pip-log.txt
50
+ pip-delete-this-directory.txt
51
+
52
+ # Unit test / coverage reports
53
+ htmlcov/
54
+ .tox/
55
+ .coverage
56
+ .coverage.*
57
+ .cache
58
+ nosetests.xml
59
+ coverage.xml
60
+ *,cover
61
+ .hypothesis/
62
+ pytest.ini
63
+ output.txt
64
+
65
+ # Translations
66
+ *.mo
67
+ *.pot
68
+
69
+ # Django stuff:
70
+ *.log
71
+ local_settings.py
72
+
73
+ # Flask stuff:
74
+ instance/
75
+ .webassets-cache
76
+
77
+ # Scrapy stuff:
78
+ .scrapy
79
+
80
+ # Sphinx documentation
81
+ docs/_build/
82
+
83
+ # PyBuilder
84
+ target/
85
+
86
+ # IPython Notebook
87
+ .ipynb_checkpoints
88
+
89
+ # pyenv
90
+ .python-version
91
+
92
+ # celery beat schedule file
93
+ celerybeat-schedule
94
+
95
+ # dotenv
96
+ .env
97
+
98
+ # virtualenv
99
+ venv/
100
+ .venv/
101
+ ENV/
102
+
103
+ # Spyder project settings
104
+ .spyderproject
105
+
106
+ # Rope project settings
107
+ .ropeproject
108
+
109
+ # vscode
110
+ .vscode
111
+
112
+ # Mac
113
+ .DS_Store
114
+
115
+ # vim
116
+ *.swp
117
+
118
+ # emacs
119
+ *~
120
+
121
+ # ckpt
122
+ *.lock
123
+
124
+ # data
125
+ *.parquet
126
+ /eval/data/
127
+
128
+
129
+ # local logs
130
+ logs
131
+ log
132
+ outputs
133
+ .history
134
+ /checkpoints/
135
+ /outputs/
136
+
137
+ eval/data/
138
+
139
+ eval/data/
.gitmodules ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ [submodule "recipe"]
2
+ path = recipe
3
+ url = https://github.com/verl-project/verl-recipe.git
.pre-commit-config.yaml ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ repos:
2
+ - repo: https://github.com/astral-sh/ruff-pre-commit
3
+ rev: "v0.12.2"
4
+ hooks:
5
+ - id: ruff
6
+ args: ["--fix", "--show-fixes", "--output-format=full"]
7
+ exclude: ^.*\.(ipynb)$
8
+ - id: ruff-format
9
+
10
+ - repo: https://github.com/pre-commit/mirrors-mypy
11
+ rev: "v1.17.0"
12
+ hooks:
13
+ - id: mypy
14
+
15
+ - repo: local
16
+ hooks:
17
+ - id: autogen-trainer-cfg
18
+ name: Generate and verify verl/trainer/config/_generated_*.yaml
19
+ entry: scripts/generate_trainer_config.sh
20
+ language: script
21
+ pass_filenames: false
22
+
23
+ - repo: local
24
+ hooks:
25
+ - id: check-docstrings
26
+ name: Check doc string coverage
27
+ entry: python3 tests/special_sanity/check_docstrings.py
28
+ language: python
29
+ pass_filenames: false
30
+
31
+ - repo: local
32
+ hooks:
33
+ - id: check-license
34
+ name: Check license
35
+ entry: python3 tests/special_sanity/check_license.py --directories examples scripts tests verl setup.py
36
+ language: python
37
+ pass_filenames: false
38
+
39
+ - repo: local
40
+ hooks:
41
+ - id: compileall
42
+ name: Compile all python files
43
+ entry: sh -c 'PYTHONWARNINGS=error python3 -m compileall -q . -x "(^|[\\/])(\.venv|venv|\.git)([\\/]|$)"'
44
+ language: python
45
+ pass_filenames: false
.readthedocs.yaml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Read the Docs configuration file
2
+ # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
3
+
4
+ version: 2
5
+
6
+ build:
7
+ os: ubuntu-22.04
8
+ tools:
9
+ python: "3.11"
10
+ rust: "1.70"
11
+
12
+ sphinx:
13
+ configuration: docs/conf.py
14
+
15
+ python:
16
+ install:
17
+ - requirements: docs/requirements-docs.txt
18
+ - method: pip
19
+ path: .
CONTRIBUTING.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to verl
2
+
3
+ Thank you for considering a contribution to verl! We welcome contributions of any kind - bug fixes, enhancements, documentation improvements, or even just feedback. Whether you're an experienced developer or this is your first open-source project, your help is invaluable.
4
+
5
+ Your support can take many forms:
6
+ - Report issues or unexpected behaviors.
7
+ - Suggest or implement new features.
8
+ - Improve or expand documentation.
9
+ - Review pull requests and assist other contributors.
10
+ - Spread the word: share verl in blog posts, social media, or give the repo a ⭐.
11
+
12
+ ## Finding Issues to Contribute
13
+
14
+ Looking for ways to dive in? Check out these issues:
15
+ - [Good first issues](https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22)
16
+ - [Call for contribution](https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22call%20for%20contribution%22)
17
+ Furthermore, you can learn the development plan and roadmap via [RFC](https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20label%3ARFC) and [Roadmap](https://github.com/volcengine/verl/issues?q=state%3Aopen%20label%3A%22roadmap%22).
18
+
19
+
20
+ ## Developing
21
+
22
+ - **Python-only**: install verl via `pip install -e .[test,vllm]` or `pip install -e .[test,sglang]` and iterate quickly. For full dependency setup, check out the verl [installation doc](https://verl.readthedocs.io/en/latest/start/install.html).
23
+
24
+ ## Code Linting and Formatting
25
+
26
+ We rely on pre-commit to keep our code consistent. To set it up:
27
+
28
+ ```bash
29
+ pip install pre-commit
30
+ pre-commit install
31
+ # for staged changes
32
+ pre-commit run
33
+ # for all files in the repo
34
+ pre-commit run --all-files
35
+ # run a specific hook with pre-commit
36
+ # pre-commit run --all-files --show-diff-on-failure --color=always <hood-id>
37
+ pre-commit run --all-files --show-diff-on-failure --color=always ruff
38
+ pre-commit run --all-files --show-diff-on-failure --color=always autogen-trainer-cfg
39
+ ```
40
+
41
+ ## Testing
42
+
43
+ Our test suites run on GitHub Actions. Check these workflows for details:
44
+ - [GPU unit tests](https://github.com/volcengine/verl/blob/main/.github/workflows/gpu_unit_tests.yml)
45
+ - [CPU unit tests](https://github.com/volcengine/verl/blob/main/.github/workflows/cpu_unit_tests.yml)
46
+ - [vLLM tests](https://github.com/volcengine/verl/blob/main/.github/workflows/vllm.yml)
47
+ - [SGLang tests](https://github.com/volcengine/verl/blob/main/.github/workflows/sgl.yml)
48
+
49
+ ### Adding CI tests
50
+
51
+ If possible, please add CI test(s) for your new feature:
52
+
53
+ 1. Find the most relevant workflow yml file, which usually corresponds to a `hydra` default config (e.g. `ppo_trainer`, `ppo_megatron_trainer`, `sft_trainer`, etc).
54
+ 2. Add related path patterns to the `paths` section if not already included.
55
+ 3. Minimize the workload of the test script(s) (see existing scripts for examples).
56
+
57
+ ## Building the Docs
58
+ ```
59
+ # Ensure verl is on your PYTHONPATH, e.g.:
60
+ pip install -e .[test]
61
+
62
+ # Install documentation dependencies
63
+ cd docs
64
+ pip install -r requirements-docs.txt
65
+
66
+ # Generate HTML docs
67
+ make clean
68
+ make html
69
+
70
+ # Preview locally
71
+ python -m http.server -d _build/html/
72
+ ```
73
+ Open your browser at http://localhost:8000 to explore the docs.
74
+
75
+ ## Pull Requests & Code Reviews
76
+
77
+ Thanks for submitting a PR! To streamline reviews:
78
+ - Follow our Pull Request Template for title format and checklist.
79
+ - Adhere to our pre-commit lint rules and ensure all checks pass.
80
+ - Update docs for any user-facing changes.
81
+ - Add or update tests in the CI workflows, or explain why tests aren't applicable.
82
+
83
+ ## License
84
+
85
+ See the [LICENSE](https://github.com/volcengine/verl/blob/main/LICENSE) file for full details.
86
+
87
+ ## Thank You
88
+
89
+ We appreciate your contributions to verl. Your efforts help make the project stronger and more user-friendly. Happy coding!
90
+