Lekr0 commited on 10 days ago

Commit

40d87dd

verified ·

1 Parent(s): a4f7c99

Add files using upload-large-folder tool

Browse files

Files changed (50) hide show

IDEA_REPORT.md +187 -0
datasets/_workspace_hanrui_datasets_HuggingFaceH4___aime_2024_default_0.0.0_2fe88a2f1091d5048c0f36abc874fb997b3dd99a.lock +0 -0
datasets/_workspace_hanrui_datasets_MathArena___aime_2025_default_0.0.0_beca2d7875cf92cdac07acefbccad3c4d16e2916.lock +0 -0
datasets/_workspace_hanrui_datasets_google-research-datasets___mbpp_sanitized_0.0.0_4bb6404fdc6cacfda99d4ac4205087b89d32030c.lock +0 -0
datasets/_workspace_hanrui_datasets_json_default-3ab01998402731b9_0.0.0_c181ad2be84b86e0b75142bbe88bda3f4906d051ee75b5ff536a5dba0ffbe8f2.lock +0 -0
datasets/_workspace_hanrui_datasets_princeton-nlp___swe-bench_lite_default_0.0.0_6ec7bb89b9342f664a54a6e0a6ea6501d3437cc2.lock +0 -0
datasets/_workspace_hanrui_datasets_tatsu-lab___alpaca_default_0.0.0_dce01c9b08f87459cf36a430d809084718273017.lock +0 -0
datasets/download_nemotron_codealpha.sh +10 -0
manage_subgits.sh +87 -0
nohup.out +48 -0
progress/dflash_lora_changelog.md +232 -0
progress/list.md +12 -0
progress/oom_fix_progress.md +42 -0
progress/requirements.txt +20 -0
progress/step1.md +139 -0
sglang/.codespellrc +3 -0
sglang/.editorconfig +25 -0
sglang/.isort.cfg +3 -0
sglang/.pre-commit-config.yaml +83 -0
sglang/CODE_OF_CONDUCT.md +128 -0
sglang/LICENSE +201 -0
sglang/README.md +90 -0
syxin_old/DFLASH_LORA_INJECT_FIXES.md +142 -0
syxin_old/backup.log +0 -0
syxin_old/dflash_8gpu_03-31-13:40.log +552 -0
syxin_old/diagnostic_compare.py +301 -0
syxin_old/eval_alignment_diff.md +132 -0
syxin_old/eval_dflash_b16_baseline.py +354 -0
syxin_old/eval_dflash_b16_baseline_changelog.md +143 -0
syxin_old/eval_dflash_lora_inject.py +660 -0
syxin_old/eval_gsm8k_humaneval_mtbench.log +81 -0
syxin_old/eval_run.log +0 -0
syxin_old/launch_train.sh +37 -0
syxin_old/launch_train_dflash_wrapper.py +17 -0
syxin_old/launch_train_random_anchor.py +15 -0
syxin_old/launch_train_wrapper.py +21 -0
syxin_old/list.md +12 -0
syxin_old/merge_lora.py +66 -0
syxin_old/oom_fix_progress.md +42 -0
syxin_old/random_anchor_plan.md +82 -0
syxin_old/requirements.txt +0 -0
syxin_old/run_bench_dflash.sh +71 -0
syxin_old/run_bench_dflash_b16_baseline.sh +60 -0
syxin_old/run_bench_dflash_lora_inject.sh +60 -0
syxin_old/run_qwen3_8b_sft_64gpu.sh +31 -0
syxin_old/run_train_dflash_lora_inject.sh +73 -0
syxin_old/run_train_multinode.sh +67 -0
syxin_old/run_train_multinode_random_anchor.sh +72 -0
syxin_old/start_server.sh +42 -0
syxin_old/start_server_dflash.sh +54 -0

IDEA_REPORT.md ADDED Viewed

	@@ -0,0 +1,187 @@

+# DFlash Improvement Ideas: Higher Acceptance Length Without Training
+**Goal:** Improve DFlash's acceptance length (tau) and acceleration ratio using only inference-time modifications — no additional training.
+**Baseline:** Qwen3-4B + z-lab/Qwen3-4B-DFlash-b16, block_size=16, math500 (10 samples, 512 tokens)
+- **Baseline avg tau = 8.63**, median = 8.0
+---
+## Idea 1: Iterative Block Refinement (Multi-Step Denoising)⭐⭐⭐⭐⭐
+**Core Idea:** Run the DFlash draft model multiple times on the same block. After each pass, use the sampled tokens as updated noise embeddings for the next pass, mimicking multi-step diffusion denoising.
+**Why it might work:** DFlash currently uses a single forward pass to predict all block tokens from mask tokens. The initial mask embeddings carry no information about what the draft should generate. By iterating, each pass conditions on an increasingly informed noise context — the first pass gives a rough draft, the second pass refines it with better token embeddings as context.
+**Implementation complexity:** Low. Just loop the draft forward pass 2-3 times, feeding output back as input. No KV cache across steps.
+**Expected improvement:** +0.5 to +2.0 tau (denoising is the core mechanism of diffusion models — more steps should help).
+**Risk:** Extra draft compute may negate speedup gains. Must keep step count low (2-3) to maintain wall-clock advantage.
+**Pilot result:** `[PENDING]`
+## Idea 1 plus: Confidence-Gated Selective Redrafting
+**Core Idea:** After the first draft pass, compute per-position entropy of the draft logits. If any position (especially early ones) has high entropy (>threshold), run a second draft pass with the partially-filled block as context. Only replace the high-entropy positions with the second pass's predictions.
+**Why it might work:** High entropy at a position signals that the draft model is uncertain — these are the positions most likely to cause rejection. A second pass, now conditioned on a partially-correct draft, can refine exactly these problematic positions.
+**Implementation complexity:** Medium. Two draft passes + entropy computation + selective replacement.
+**Expected improvement:** +0.5 to +2.0 tau (targeted improvement where it matters most).
+**Risk:** Extra compute for the second pass. Entropy threshold needs tuning per dataset/model.
+**Pilot result:** `[PENDING]
+---
+## Idea 2: N-Best Draft Proposals (Multi-Candidate Selection)⭐
+**Core Idea:** Generate K candidate draft blocks (K=2-4) using different sampling strategies (greedy + temperature-based), then select the candidate with the highest aggregate log-probability under the draft model's own distribution.
+**Why it might work:** Exact-match acceptance is binary — a single wrong token kills the entire suffix. By generating multiple candidates and picking the most confident one, we increase the probability that at least one candidate matches the target's greedy output. The confidence score acts as a proxy for "likely to match target."
+**Implementation complexity:** Low-Medium. K forward passes per block, simple confidence scoring.
+**Expected improvement:** +0.5 to +2.5 tau (especially for "unlucky" blocks where the default greedy choice is wrong).
+**Risk:** K times the draft compute cost. Must keep K small. Confidence score may not perfectly correlate with acceptance.
+**Pilot result:** `[PENDING]`
+---
+## Idea 6: Token Recycling / Warm-Start Drafting⭐⭐⭐
+**Core Idea:** When rejection occurs at position j in a block of B tokens, the rejected tokens at positions j+1..B are discarded. Instead, save these tokens and use them to warm-start the noise embeddings of the next draft block. This gives the draft model a better starting point than random mask tokens.
+**Why it might work:** Even though the prefix was wrong, later tokens in the rejected draft may still carry useful distributional information about the continuation. Using them as initial noise (instead of mask tokens) gives the draft model more context for its single-pass prediction.
+**Implementation complexity:** Low. Save rejected suffix, inject into next block's initial embeddings.
+**Expected improvement:** +0.3 to +1.0 tau (modest, since the recycled tokens are conditioned on a wrong prefix).
+**Risk:** Recycled tokens may actually mislead the draft model if they were generated from a very different prefix. Net effect could be negative.
+**Pilot result:** `[PENDING]`
+---
+## Idea 9: Dynamic Target Layer Selection
+**Core Idea:** Instead of always extracting features from the same 5 fixed target layers, try alternative layer selections (e.g., shifted by +2 or -2) and pick the one that produces the highest-confidence draft. Different parts of the sequence may benefit from different layers.
+**Why it might work:** The paper's ablation (Table 5) shows that layer selection affects acceptance length. The optimal layers may vary by position in the sequence or by the type of content being generated. Late layers have more "final answer" information; early layers have more syntactic/structural information.
+**Implementation complexity:** Medium. Multiple draft passes with different layer configs + scoring.
+**Expected improvement:** +0.3 to +1.5 tau (if the fixed layers are suboptimal for certain content types).
+**Risk:** The draft model's fc projection was trained on specific layer combinations. Using different layers degrades the learned alignment. Needs the fc layer to generalize.
+**Pilot result:** `[PENDING]`
+---
+## Idea 11: Top-K Constrained Draft SamplingIdea 7: Confidence-Gated Selective Redrafting⭐
+**Core Idea:** After the first draft pass, compute per-position entropy of the draft logits. If any position (especially early ones) has high entropy (>threshold), run a second draft pass with the partially-filled block as context. Only replace the high-entropy positions with the second pass's predictions.
+**Why it might work:** High entropy at a position signals that the draft model is uncertain — these are the positions most likely to cause rejection. A second pass, now conditioned on a partially-correct draft, can refine exactly these problematic positions.
+**Implementation complexity:** Medium. Two draft passes + entropy computation + selective replacement.
+**Expected improvement:** +0.5 to +2.0 tau (targeted improvement where it matters most).
+**Risk:** Extra compute for the second pass. Entropy threshold needs tuning per dataset/model.
+**Pilot result:** `[PENDING]
+**Core Idea:** Apply top-k filtering to draft logits before sampling, zeroing out all but the top-k tokens at each position. This forces the draft to choose among only the most probable tokens.
+**Why it might work:** For exact-match acceptance under greedy target decoding, only the target's argmax token matters. By restricting the draft's vocabulary to its own top-k, we reduce the chance of sampling a low-probability token that definitely won't match the target.
+**Implementation complexity:** Very low. Single top-k operation on logits.
+**Expected improvement:** +0.1 to +0.5 tau (minor, since greedy draft already picks argmax; mainly helps with stochastic target).
+**Risk:** Under greedy draft + greedy target, this is a no-op. Only helps when draft uses non-zero temperature.
+**Pilot result:** `[PENDING]`
+---
+## Idea 12: Position-Weighted Logit Scaling⭐⭐
+**Core Idea:** Scale draft logits by a position-dependent factor: early positions get more aggressive scaling (sharper distribution = higher confidence), later positions get gentler scaling. Rationale: early positions matter most for prefix-based acceptance.
+**Why it might work:** By sharpening early positions, we increase the probability that positions 1-3 are correct (the most critical for tau). Later positions can afford to be less sharp since they only matter if all earlier positions are accepted.
+**Implementation complexity:** Very low. Multiply logits by a position-dependent vector.
+**Expected improvement:** +0.2 to +1.0 tau.
+**Risk:** Over-sharpening may concentrate probability on a wrong token. Needs careful calibration of the scaling schedule.
+**Pilot result:** `[PENDING]`
+---
+## Bonus Ideas (Not Yet Implemented)
+### Idea 13: Tree-Structured Verification
+Verify multiple candidate continuations in a single batched target forward pass using packed attention with tree causal masks. This doesn't improve tau per-candidate but amortizes the verification cost across candidates, enabling higher effective throughput. Very promising for combining with N-best or beam approaches.
+### Idea 16: Draft-Target KL Alignment via Inference-Time Calibration⭐⭐⭐
+Compute a lightweight calibration mapping (affine transform on draft logits) by running a small calibration set and measuring draft vs target token agreement. Apply this calibration at inference time without retraining.
+### Idea 17: Multi-Block Pipelining
+Overlap the draft and verification phases across blocks. While the target model verifies block k, the draft model starts working on block k+1 using a speculative target_hidden extrapolation. If the speculation was right, the pipeline stays full.
+---
+## Experiment Configuration
+| Parameter | Value |
+|-----------|-------|
+| Target model | Qwen/Qwen3-4B |
+| Draft model | z-lab/Qwen3-4B-DFlash-b16 |
+| Block size | 16 |
+| Dataset | math500 |
+| Max samples | 10 |
+| Max new tokens | 512 |
+| Temperature | 0.0 (greedy) |
+| GPU | NVIDIA H200 (single GPU) |
+| Attention | SDPA |
+## Results Summary
+| # | Method | Avg tau | Delta | Pilot Signal |
+|---|--------|---------|-------|--------------|
+| 0 | **Baseline** | **8.63** | - | - |
+| 1 | Iterative Refinement (2 steps) | `[PENDING]` | | |
+| 2 | Iterative Refinement (3 steps) | `[PENDING]` | | |
+| 3 | N-Best Draft (K=2) | `[PENDING]` | | |
+| 4 | N-Best Draft (K=3) | `[PENDING]` | | |
+| 5 | Adaptive Block Size (4-16) | `[PENDING]` | | |
+| 6 | Early-Position Beam (width=3) | `[PENDING]` | | |
+| 7 | Draft Temp t=0.3 | `[PENDING]` | | |
+| 8 | Draft Temp t=0.1 | `[PENDING]` | | |
+| 9 | Token Recycling | `[PENDING]` | | |
+| 10 | Selective Redraft (ent>1.5) | `[PENDING]` | | |
+| 11 | Selective Redraft (ent>1.0) | `[PENDING]` | | |
+| 12 | Majority Vote (K=3) | `[PENDING]` | | |
+| 13 | Majority Vote (K=5) | `[PENDING]` | | |
+| 14 | Shifted Target Layers (+2) | `[PENDING]` | | |
+| 15 | Logit Averaging (2 pass) | `[PENDING]` | | |
+| 16 | Logit Averaging (3 pass) | `[PENDING]` | | |
+| 17 | Top-K Constrained (k=10) | `[PENDING]` | | |
+| 18 | Position-Weighted Temp | `[PENDING]` | | |
+---
+*Generated 2026-04-01. Experiments running on NVIDIA H200, dflash conda env.*

datasets/_workspace_hanrui_datasets_HuggingFaceH4___aime_2024_default_0.0.0_2fe88a2f1091d5048c0f36abc874fb997b3dd99a.lock ADDED Viewed

File without changes

datasets/_workspace_hanrui_datasets_MathArena___aime_2025_default_0.0.0_beca2d7875cf92cdac07acefbccad3c4d16e2916.lock ADDED Viewed

File without changes

datasets/_workspace_hanrui_datasets_google-research-datasets___mbpp_sanitized_0.0.0_4bb6404fdc6cacfda99d4ac4205087b89d32030c.lock ADDED Viewed

File without changes

datasets/_workspace_hanrui_datasets_json_default-3ab01998402731b9_0.0.0_c181ad2be84b86e0b75142bbe88bda3f4906d051ee75b5ff536a5dba0ffbe8f2.lock ADDED Viewed

File without changes

datasets/_workspace_hanrui_datasets_princeton-nlp___swe-bench_lite_default_0.0.0_6ec7bb89b9342f664a54a6e0a6ea6501d3437cc2.lock ADDED Viewed

File without changes

datasets/_workspace_hanrui_datasets_tatsu-lab___alpaca_default_0.0.0_dce01c9b08f87459cf36a430d809084718273017.lock ADDED Viewed

File without changes

datasets/download_nemotron_codealpha.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+#!/bin/bash
+export HF_TOKEN="YOUR_HF_TOKEN_HERE"
+export HF_HUB_ENABLE_HF_TRANSFER=1
+export HF_HUB_VERBOSITY=debug
+hf download \
+  --repo-type dataset \
+  --local-dir /workspace/hanrui/datasets/Nemotron-CodeAlpaca-qwen3-8b-800K \
+  eigen-ai-labs/Nemotron-CodeAlpaca-qwen3-8b-800K

manage_subgits.sh ADDED Viewed

	@@ -0,0 +1,87 @@

+#!/bin/bash
+# 管理子目录中的 .git 文件夹：备份、删除、恢复
+# 用法:
+#   ./manage_subgits.sh backup   - 备份并删除子目录中的 .git
+#   ./manage_subgits.sh restore  - 从备份中恢复 .git
+set -euo pipefail
+cd "$(dirname "$0")"
+BACKUP_DIR=".git_backups"
+MANIFEST="$BACKUP_DIR/manifest.txt"
+backup() {
+    if [ -d "$BACKUP_DIR" ]; then
+        echo "❌ 备份目录 $BACKUP_DIR 已存在，请先 restore 或手动删除"
+        exit 1
+    fi
+    mkdir -p "$BACKUP_DIR"
+    > "$MANIFEST"
+    count=0
+    while IFS= read -r gitdir; do
+        count=$((count + 1))
+        echo "$count|$gitdir" >> "$MANIFEST"
+        echo "📦 备份: $gitdir"
+        cp -a "$gitdir" "$BACKUP_DIR/$count"
+        echo "🗑️  删除: $gitdir"
+        rm -rf "$gitdir"
+    done < <(find . -mindepth 2 -name ".git" -not -path "./$BACKUP_DIR/*" | sort)
+    if [ "$count" -eq 0 ]; then
+        rm -rf "$BACKUP_DIR"
+        echo "ℹ️  没有找到子目录中的 .git，无需操作"
+    else
+        echo ""
+        echo "✅ 完成！共备份并删除了 $count 个 .git"
+        echo "📁 备份存放在: $BACKUP_DIR/"
+        echo "👉 上传完成后运行: $0 restore"
+    fi
+}
+restore() {
+    if [ ! -f "$MANIFEST" ]; then
+        echo "❌ 找不到备份清单 $MANIFEST，没有可恢复的内容"
+        exit 1
+    fi
+    count=0
+    while IFS='|' read -r id gitdir; do
+        if [ ! -d "$BACKUP_DIR/$id" ]; then
+            echo "⚠️  跳过: 备份 #$id 不存在 ($gitdir)"
+            continue
+        fi
+        mkdir -p "$(dirname "$gitdir")"
+        echo "♻️  恢复: $gitdir"
+        cp -a "$BACKUP_DIR/$id" "$gitdir"
+        count=$((count + 1))
+    done < "$MANIFEST"
+    rm -rf "$BACKUP_DIR"
+    echo ""
+    echo "✅ 完成！共恢复了 $count 个 .git"
+    echo "🧹 备份目录已清理"
+}
+case "${1:-}" in
+    backup)
+        backup
+        ;;
+    restore)
+        restore
+        ;;
+    *)
+        echo "用法: $0 {backup|restore}"
+        echo ""
+        echo "  backup  - 备份子目录中所有 .git 并删除它们"
+        echo "  restore - 从备份中恢复所有 .git"
+        exit 1
+        ;;
+esac

nohup.out ADDED Viewed

	@@ -0,0 +1,48 @@

+/workspace/miniconda3/envs/dflash/bin/python3: can't open file '/workspace/hanrui/ ': [Errno 2] No such file or directory
+E0317 16:57:14.100000 140364991186752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 14058) of binary: /workspace/miniconda3/envs/dflash/bin/python3
+Traceback (most recent call last):
+  File "<frozen runpy>", line 198, in _run_module_as_main
+  File "<frozen runpy>", line 88, in _run_code
+  File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 905, in <module>
+    main()
+  File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
+    return f(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^
+  File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
+    run(args)
+  File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
+    elastic_launch(
+  File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
+    return launch_agent(self._config, self._entrypoint, list(args))
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
+    raise ChildFailedError(
+torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
+============================================================
+  FAILED
+------------------------------------------------------------
+Failures:
+  <NO_OTHER_FAILURES>
+------------------------------------------------------------
+Root Cause (first observed failure):
+[0]:
+  time      : 2026-03-17_16:57:14
+  host      : job-006ce80a7c47-20260302193512-5dcd4c9bbd-gfjsn
+  rank      : 0 (local_rank: 0)
+  exitcode  : 2 (pid: 14058)
+  error_file: <N/A>
+  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
+============================================================
+usage: run.py [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
+              [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT]
+              [--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone]
+              [--max-restarts MAX_RESTARTS]
+              [--monitor-interval MONITOR_INTERVAL]
+              [--start-method {spawn,fork,forkserver}] [--role ROLE] [-m]
+              [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
+              [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER]
+              [--node-rank NODE_RANK] [--master-addr MASTER_ADDR]
+              [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
+              [--logs-specs LOGS_SPECS]
+              training_script ...
+run.py: error: the following arguments are required: training_script, training_script_args

progress/dflash_lora_changelog.md ADDED Viewed

	@@ -0,0 +1,232 @@

+# DFlash LoRA 全部改动记录
+## 概述
+为了让 Qwen3-8B DFlash LoRA 训练在 2×H100 上跑通（解决 OOM），共新增/修改了 **5 个文件，1084 行代码**。改动分为两大阶段：基础搭建 + OOM 修复。
+---
+## 新增文件清单
+| 文件 | 行数 | 用途 |
+|------|------|------|
+| `specforge/core/dflash_lora.py` | 453 | 训练 wrapper（OnlineDFlashLoRAModel） |
+| `specforge/modeling/draft/dflash_lora.py` | 141 | LoRA draft 模型（DFlashLoRADraftModel） |
+| `scripts/train_dflash_lora.py` | 449 | 训练入口脚本 |
+| `scripts/run_train_dflash_lora.sh` | 31 | 启动 shell 脚本 |
+| `configs/qwen3-8b-dflash-lora.json` | 10 | LoRA 配置文件 |
+---
+## Step 1 完成过程
+### 1.1 分析现有代码
+首先分析了非 LoRA 版 `train_dflash.py` 的完整流程：
+```
+input_ids → target_model.generate_dflash_data() → hidden_states
+         → OnlineDFlashModel.forward():
+             1. 截断到 block 边界
+             2. prepare_noise_input(): anchor 保留，其余 → MASK
+             3. embed_tokens(noise_input_ids) → noise_embedding
+             4. 构建 DFlash attention mask
+             5. draft_model(noise_embedding, target_hidden, mask)
+             6. lm_head(hidden) → logits → CE loss
+```
+非 LoRA 版使用独立的小型 draft model + 冻结 target model 提取 hidden states。
+### 1.2 确定 LoRA 版设计差异
+| 方面 | 非 LoRA 版 (`train_dflash.py`) | LoRA 版 (`train_dflash_lora.py`) |
+|------|------|------|
+| Draft model | 自定义小模型 (1-10 层) | Qwen3-8B + PEFT LoRA |
+| Target model | 冻结大模型提取 hidden states | 无需 — 模型用自身表征 |
+| Attention | 自定义 Qwen3DFlashAttention，KV = [ctx, noise] concat | 标准 HF attention + DFlash mask |
+| KV 结构 | Q_LEN = noise_len, KV_LEN = 2×noise_len | Q_LEN = KV_LEN = seq_len |
+| 可训练参数 | 全部 draft model 参数 | 仅 LoRA (q/k/v/o_proj) |
+### 1.3 新建 LoRA 版三个核心文件
+#### `specforge/modeling/draft/dflash_lora.py` — DFlashLoRADraftModel
+- `from_pretrained()`: 加载 Qwen3-8B，注入 PEFT LoRA，支持 `attn_implementation` 参数
+- `forward()`: 标准 HF forward，支持 `output_hidden_states` 参数（chunked loss 需要）
+- `get_lm_head()`: 穿透 PEFT 层级获取 lm_head 引用
+- `gradient_checkpointing_enable()`: 代理到底层模型
+- `save_pretrained()`: 仅保存 LoRA adapter 权重
+#### `specforge/core/dflash_lora.py` — OnlineDFlashLoRAModel
+- `prepare_noise_input()`: context 部分保持不变，block 部分只保留 anchor，其余替换为 MASK
+- `build_dflash_full_attn_mask_fast()`: 向量化构建 4D additive mask `[bsz, 1, seq, seq]`
+- `_compute_loss_weights()`: context + anchor 权重为 0，非 anchor 权重为 1（或 decay）
+- `_full_lm_loss()`: 标准 CE loss 路径
+- `_compute_accuracy()`: block-wise acceptance rate（累积正确预测长度 / block 非 anchor 长度）
+- `forward()`: 完整训练 forward pass
+LoRA 版 mask 规则：
+- context token i → 因果注意力 (j ≤ i)
+- block token i (属于 block b) → 所有 context + 同 block 内双向注意力
+#### `scripts/train_dflash_lora.py` — 训练脚本
+- 参数解析：model/lora/dataset/training/output/distributed/tracker 7 组参数
+- `build_model()`: 加载模型 + 注入 LoRA + 包装 OnlineDFlashLoRAModel
+- `build_dataloader()`: 复用 `build_eagle3_dataset` 和 `prepare_dp_dataloaders`
+- FSDP 包装 + BF16Optimizer
+- 训练循环：forward → backward → accumulation → optimizer step
+- checkpoint 保存/恢复
+---
+## OOM 修复改动（4 项）
+### 改动 1: FSDP FULL_SHARD (ZeRO-3)
+**问题**: `SHARD_GRAD_OP` (ZeRO-2) 每卡持有完整 Qwen3-8B 参数 (~16GB bf16)
+**修复**: `train_dflash_lora.py:362`
+```python
+# 之前
+sharding_strategy=ShardingStrategy.SHARD_GRAD_OP
+# 之后
+sharding_strategy=ShardingStrategy.FULL_SHARD
+```
+**效果**: 参数跨卡分片，每卡省 ~8-12GB
+### 改动 2: batch_size=1 + accumulation_steps=8
+**问题**: `batch_size=2` 时峰值显存过高
+**修复**: `run_train_dflash_lora.sh`
+```bash
+--batch-size 1 \
+--accumulation-steps 8 \
+```
+**效果**: 等效 global batch size 不变，峰值显存减半
+### 改动 3: flex_attention + BlockMask 替换 4D additive mask
+**问题**: SDPA 不支持 4D additive mask → fallback 到 math backend → 每层 materialize 完整 `[bsz, 32heads, 2048, 2048]` attention scores
+**修复**: 从非 LoRA 版 `dflash.py` 移植 `_get_or_create_block_mask()` 方法，适配 LoRA 场景
+涉及文件：
+1. **`specforge/core/dflash_lora.py`**
+   - `__init__()`: 添加 `attention_backend` 参数（默认 `"flex_attention"`），BlockMask 缓存字段
+   - 新增 `_get_or_create_block_mask()`: 用 `create_block_mask()` 构建零显存的 BlockMask
+   - `forward()`: 根据 `attention_backend` 选择 BlockMask 或 additive mask
+2. **`specforge/modeling/draft/dflash_lora.py`**
+   - `from_pretrained()`: 当 backend 为 flex_attention 时，传 `attn_implementation="flex_attention"` 给 HuggingFace
+3. **`scripts/train_dflash_lora.py`**
+   - `parse_args()`: `--attention-backend` 参数 (`flex_attention` | `additive`)
+   - `build_model()`: 根据 backend 选择 `attn_implementation`
+BlockMask mask function（LoRA 版）：
+```python
+def dflash_lora_mask_fn(b, h, q_idx, kv_idx):
+    # Context query: 标准因果
+    is_q_ctx = q_idx < context_len
+    ctx_visible = is_q_ctx & (kv_idx <= q_idx)
+    # Block query: 全部 context + 同 block 双向
+    is_q_block = q_idx >= context_len
+    is_k_ctx = kv_idx < context_len
+    q_block_id = (q_idx - context_len) // block_size
+    k_block_id = (kv_idx - context_len) // block_size
+    block_attend_ctx = is_q_block & is_k_ctx
+    block_attend_same = is_q_block & (~is_k_ctx) & (q_block_id == k_block_id)
+    return ctx_visible | (block_attend_ctx | block_attend_same)
+```
+**验证**: 手动逐元素对比 BlockMask 和 additive mask 输出，三组测试 (context_len=4/0, seq=12/16/64) pattern 完全一致。
+**效果**: 不再 fallback 到 SDPA math backend，省去 `[bsz, heads, seq, seq]` attention scores 显存
+### 改动 4: chunked cross-entropy loss
+**问题**: `[bsz, 2048, 151936]` bf16 logits ≈ 1.18GB，加梯度 ~2.4GB+
+**修复**: 从非 LoRA 版 `dflash.py:419-478` 移植 chunked loss
+涉及文件：
+1. **`specforge/core/dflash_lora.py`**
+   - `__init__()`: 添加 `lm_head_chunk_size` 参数（默认 0 = 不启用）
+   - 新增 `_chunked_lm_loss()`: 分 chunk 过 lm_head + CE loss + gradient checkpointing
+   - 提取 `_full_lm_loss()`: 原始非 chunked 路径
+   - `forward()`: `lm_head_chunk_size > 0` 时走 chunked 路径
+2. **`specforge/modeling/draft/dflash_lora.py`**
+   - `forward()`: 新增 `output_hidden_states` 参数，True 时返回 last hidden state 而非 logits
+   - `get_lm_head()`: 穿透 PEFT 层级返回 `base_model.lm_head` 引用
+3. **`scripts/train_dflash_lora.py`**
+   - `parse_args()`: `--lm-head-chunk-size` 参数（默认 0，推荐 256）
+   - `build_model()`: 传递到 OnlineDFlashLoRAModel
+Chunked loss 核心逻辑：
+```python
+# 分 chunk 计算，每 chunk 用 gradient checkpointing（backward 时重算 logits，不存储）
+for start in range(0, effective_len, chunk_size):
+    end = min(start + chunk_size, effective_len)
+    chunk_loss, chunk_weight = grad_checkpoint(
+        _chunk_ce,                          # lm_head + CE
+        hidden[:, start:end, :],            # 只取当前 chunk
+        input_ids[:, start:end],
+        combined_mask[:, start:end],
+        use_reentrant=False,
+    )
+    total_loss += chunk_loss
+    total_weight += chunk_weight
+loss = total_loss / total_weight
+```
+**效果**: logits 峰值显存从 `O(seq_len × vocab_size)` 降至 `O(chunk_size × vocab_size)`，256 chunk → ~150MB vs 1.18GB
+---
+## 当前训练命令
+```bash
+bash run_train_dflash_lora.sh 2   # 2 = GPU 数量
+```
+对应完整参数：
+```bash
+torchrun --nproc_per_node 2 scripts/train_dflash_lora.py \
+    --model-path /workspace/Qwen3-8B \
+    --train-data-path /workspace/hanrui/datasets/Nemotron-CodeAlpaca-qwen3-8b-800K \
+    --output-dir outputs/qwen3-8b-dflash-lora \
+    --lora-config configs/qwen3-8b-dflash-lora.json \
+    --block-size 16 \
+    --max-length 2048 \
+    --batch-size 1 \
+    --num-epochs 3 \
+    --learning-rate 2e-4 \
+    --accumulation-steps 8 \
+    --loss-decay-gamma 7 \
+    --attention-backend flex_attention \
+    --lm-head-chunk-size 256 \
+    --gradient-checkpointing \
+    --chat-template qwen \
+    --log-interval 50 \
+    --save-interval 500
+```
+---
+## 待验证
+- [ ] 跑 `bash run_train_dflash_lora.sh 2` 确认不再 OOM
+- [ ] 确认无 SDPA math fallback warning
+- [ ] 观察 GPU 显存峰值
+- [ ] 确认 loss 下降和 accuracy 上升趋势正常

progress/list.md ADDED Viewed

	@@ -0,0 +1,12 @@

+### 1. `train_dflash_lora.py`
+* 加了lora，原来是调用小模型，现在是hidden states+lora预测。
+* `dflash_lora_mask_fn`函数是在处理预测的那一块草稿Block时，可以同时看到这一块里的所有词。
+### 2. OOM优化
+* 分片策略ZeRO-3，FSDP切分从`SHARD_GRAD_OP`升级到`FULL_SHARD`。
+* `batch-size=1`，`accumulation-steps=8`。
+* 参考之前的代码用了FlexAttention（`dflash_lora_mask_fn`）。
+* `_chunked_lm_loss()`，把算loss切片成256块来算+梯度检查。
+### 运行
+* bash /workspace/hanrui/junquan/SpecForge/scripts/run_train_dflash_lora.sh 2

progress/oom_fix_progress.md ADDED Viewed

	@@ -0,0 +1,42 @@

+# DFlash LoRA OOM 修复记录
+## OOM 根因分析
+1. **SHARD_GRAD_OP (ZeRO-2)** — 每卡持有完整 Qwen3-8B 参数 (~16GB bf16)，参数未分片
+2. **SDPA + 4D additive mask** — FlashAttention 不支持 4D additive mask，fallback 到 math backend，每层 materialize 完整 attention scores (`bsz × 32heads × 2048 × 2048`)
+3. **大 vocab logits** — `[bsz, 2048, 151936]` bf16 ≈ 1.18GB，加上梯度和 boolean indexing 拷贝，峰值 ~3-4GB
+4. **机器只有 2 张 H100**，脚本默认 `NUM_GPUS=4`
+## 已完成的改动
+### 1. FSDP sharding 改为 FULL_SHARD (ZeRO-3)
+- 文件: `SpecForge/scripts/train_dflash_lora.py:347`
+- `ShardingStrategy.SHARD_GRAD_OP` → `ShardingStrategy.FULL_SHARD`
+- 效果: 参数跨卡分片，每卡省 ~8-12GB
+### 2. 降 batch-size，提高 accumulation-steps
+- 文件: `SpecForge/scripts/run_train_dflash_lora.sh`
+- `--batch-size 2` → `1`，`--accumulation-steps 4` → `8`
+- 效果: 等效 global batch size 不变，峰值显存减半
+## 待验证 / 后续优化
+- [ ] 运行时传 `bash run_train_dflash_lora.sh 2` 确保用 2 卡
+- [x] 如仍 OOM，考虑 chunked cross-entropy loss 避免大 vocab logits 全量 materialize
+- [x] 长期可探索自定义 attention kernel 支持 block-sparse mask，绕过 SDPA math fallback
+### 3. flex_attention + BlockMask 替换 4D additive mask
+- 文件: `SpecForge/specforge/core/dflash_lora.py`, `specforge/modeling/draft/dflash_lora.py`, `scripts/train_dflash_lora.py`
+- 从非 LoRA 版 `dflash.py` 移植 `_get_or_create_block_mask()` 方法，适配 LoRA 场景 (Q_LEN == KV_LEN == seq_len)
+- LoRA 版 mask: context causal + block bidirectional (非 LoRA 版是 [context, noise] concat KV)
+- 用 `--attention-backend flex_attention` 启用 (默认)，退回 `--attention-backend additive` 走原有 4D mask
+- HuggingFace model 用 `attn_implementation="flex_attention"` 加载
+- 效果: 不再 fallback 到 SDPA math backend，省去 `[bsz, heads, seq, seq]` attention scores 的显存
+### 4. chunked cross-entropy loss
+- 文件: `SpecForge/specforge/core/dflash_lora.py`, `specforge/modeling/draft/dflash_lora.py`, `scripts/train_dflash_lora.py`
+- 从非 LoRA 版 `dflash.py` 移植 `_chunked_lm_loss()` 方法
+- 分 chunk 过 lm_head + CE loss + gradient checkpointing，避免 materialize 完整 `[bsz, seq, vocab]` logits
+- 用 `--lm-head-chunk-size 256` 启用 (默认 0 = 不启用)
+- `DFlashLoRADraftModel.forward()` 新增 `output_hidden_states` 参数，chunked 时返回 hidden states
+- 效果: logits 峰值显存从 O(seq_len × vocab_size) 降至 O(chunk_size × vocab_size)

progress/requirements.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+torch==2.4.0
+torchaudio==2.4.0
+torchvision==0.19.0
+transformers==4.57.1
+accelerate
+datasets
+peft
+numpy
+tqdm
+wandb
+psutil
+pydantic
+packaging
+ninja
+setuptools
+yunchang
+sglang<=0.4.10
+openai-harmony
+qwen-vl-utils==0.0.11
+pre-commit

progress/step1.md ADDED Viewed

	@@ -0,0 +1,139 @@

+Now I have a complete picture. Here's the full analysis and proposed plan.
+  ---
+  Current Logic of train_dflash.py
+  This script trains a small, standalone DFlash draft model (1–10 transformer layers) that learns to predict token blocks in parallel,
+  guided by hidden states extracted from a frozen target model.
+  Architecture (two separate models)
+  1. Target model (frozen, e.g. Qwen3-8B/32B) — runs a full forward pass to produce hidden states at selected layers.
+  2. Draft model (DFlashDraftModel, ~1–10 layers) — a lightweight Qwen3-based decoder that takes noise embeddings + target hidden states
+  and predicts the block tokens.
+  3. Target embed_tokens + lm_head — loaded separately via TargetEmbeddingsAndHead to avoid duplicating the full target model in memory.
+  Key locations
+  ┌──────────────────────────┬────────────────────────────────────┬───────────────────────────────────────────────────────┐
+  │        Component         │                File                │                         Lines                         │
+  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
+  │ Model init               │ scripts/train_dflash.py            │ build_models() L254–311                               │
+  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
+  │ Target hidden extraction │ scripts/train_dflash.py            │ L644–647 (target_model.generate_dflash_data)          │
+  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
+  │ Forward pass             │ specforge/core/dflash.py           │ OnlineDFlashModel.forward() L243–332                  │
+  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
+  │ Loss calculation         │ specforge/core/dflash.py           │ _full_lm_loss() L382–417, _chunked_lm_loss() L419–478 │
+  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
+  │ Loss mask                │ specforge/core/dflash.py           │ create_dflash_loss_mask() L481–509                    │
+  ├──────────────────────────┼────────────────────────────────────┼────���──────────────────────────────────────────────────┤
+  │ Draft model architecture │ specforge/modeling/draft/dflash.py │ DFlashDraftModel L212–266                             │
+  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
+  │ DFlash attention         │ specforge/modeling/draft/dflash.py │ Qwen3DFlashAttention L42–134                          │
+  └──────────────────────────┴────────────────────────────────────┴───────────────────────────────────────────────────────┘
+  Forward pass flow (per training step)
+  input_ids, attention_mask, loss_mask  →  target_model.generate_dflash_data()
+                                                ↓
+                                       hidden_states (from target layers [1,9,17,25,33])
+                                                ↓
+                                OnlineDFlashModel.forward():
+                                  1. Truncate to block boundary
+                                  2. prepare_noise_input(): anchor tokens kept, rest → MASK
+                                  3. embed_tokens(noise_input_ids) → noise_embedding
+                                  4. Build DFlash attention mask (flex_attention or additive)
+                                  5. draft_model(noise_embedding, target_hidden, mask)
+                                  6. lm_head(hidden) → logits
+                                  7. CE loss on non-anchor positions (weighted by loss_mask × decay)
+  The draft model's custom Qwen3DFlashAttention concatenates [context_hidden, noise_hidden] as KV, with queries only from noise tokens. The
+   attention mask enforces: block tokens see all preceding blocks' context + bidirectional within their own block.
+  ---
+  What already exists: train_dflash_lora.py
+  Interestingly, the repo already has a LoRA variant at scripts/train_dflash_lora.py with its own model (DFlashLoRADraftModel) and wrapper
+  (OnlineDFlashLoRAModel). This is exactly the approach you described — Qwen3-8B + LoRA, no separate target model, 1-step diffusion
+  training. The key differences from train_dflash.py:
+  ┌─────────────────┬─────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
+  │     Aspect      │                       train_dflash.py                       │                train_dflash_lora.py                │
+  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
+  │ Draft model     │ Small custom DFlashDraftModel (1–10 layers)                 │ Full Qwen3-8B + LoRA adapters                      │
+  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
+  │ Target model    │ Separate frozen model for hidden state extraction           │ None — model uses its own representations          │
+  ├─────────────────┼─────────────────────────────────────────────────────────────┼───────────────────────────���────────────────────────┤
+  │ Attention       │ Custom Qwen3DFlashAttention (Q from noise, KV from [ctx,    │ Standard HF attention with 4D additive DFlash mask │
+  │                 │ noise])                                                     │                                                    │
+  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
+  │ Forward         │ draft_model(noise_emb, target_hidden, mask)                 │ model(noise_input_ids, 4d_mask, position_ids) →    │
+  │                 │                                                             │ logits                                             │
+  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
+  │ Trainable       │ All draft model params                                      │ Only LoRA (q/k/v/o_proj), base frozen              │
+  │ params          │                                                             │                                                    │
+  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
+  │ FSDP strategy   │ SHARD_GRAD_OP                                               │ FULL_SHARD                                         │
+  └─────────────────┴─────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
+  ---
+  Proposed Modification Plan
+  Since train_dflash_lora.py already implements the core idea, the plan focuses on what's missing or needs improvement to make it a proper
+  "1-step dLLM draft model" for your research:
+  Phase 1: Validate and extend the existing LoRA pipeline
+  1. Add MLP to LoRA targets — The current config only targets q_proj, k_proj, v_proj, o_proj. For stronger 1-step diffusion capability,
+  add gate_proj, up_proj, down_proj to lora_target_modules. This gives the model more capacity to learn the non-autoregressive distribution
+   shift.
+  2. Add multi-step noise schedule support — Currently the training is strictly 1-step (all non-anchors → MASK). For a proper diffusion/AR
+  fusion, add an option for a noise schedule where a fraction of block tokens are revealed (not just the anchor), controlled by a
+  noise_ratio parameter. This would modify prepare_noise_input() in OnlineDFlashLoRAModel:
+  # Instead of: all non-anchor → MASK
+  # Allow: randomly keep some non-anchor tokens with probability (1 - noise_ratio)
+  3. Add configurable context_len strategy — Currently context_len=0 treats the whole sequence as blocks. Add a --context-ratio arg that
+  dynamically sets context_len as a fraction of the sequence, so the model learns to condition on varying amounts of AR-decoded prefix.
+  Phase 2: Training logic improvements
+  4. Add KL divergence loss — In addition to CE loss against ground truth, add an optional KL loss against the base model's AR distribution
+   (teacher forcing). This regularizes the LoRA model to stay close to the original Qwen3-8B distribution. Modify
+  OnlineDFlashLoRAModel.forward():
+  # Compute base model logits (no_grad, no LoRA) as teacher
+  # KL(draft_logits || teacher_logits) on block positions
+  # total_loss = ce_loss + kl_weight * kl_loss
+  5. Add evaluation with speculative decoding metrics — The current accuracy metric is block-wise acceptance rate. Add an eval loop that
+  actually runs speculative decoding (draft → verify) to measure real speedup, using the LoRA model as the drafter and the base model (with
+   LoRA disabled) as the verifier.
+  Phase 3: Integration with train_dflash.py style features
+  6. Port random anchor sampling — train_dflash.py has --random-anchor (L147–156) which samples diverse anchor positions per sequence. This
+   is missing from the LoRA variant and would improve training diversity. Port _sample_anchor_positions and _build_blocks_from_anchors from
+   OnlineDFlashModel to OnlineDFlashLoRAModel.
+  7. Port chunked lm_head — The LoRA variant materializes full [bsz, seq_len, vocab_size] logits. For Qwen3-8B's 152K vocab, this is ~1.2GB
+   per sample at seq_len=2048. Port the _chunked_lm_loss from OnlineDFlashModel for memory efficiency.
+  8. Add tensor parallelism support — The LoRA script currently forces tp_size=1. For Qwen3-8B on multi-GPU setups, add TP support to shard
+   the base model across GPUs while keeping LoRA params on each rank.
+  Minimal starting point
+  If you want to just get training running immediately, the existing train_dflash_lora.py + configs/qwen3-8b-dflash-lora.json should work
+  out of the box:
+  torchrun --nproc_per_node=4 scripts/train_dflash_lora.py \
+    --model-path Qwen/Qwen3-8B \
+    --lora-config configs/qwen3-8b-dflash-lora.json \
+    --train-data-path /path/to/data.jsonl \
+    --output-dir ./outputs/qwen3-8b-dflash-lora \
+    --block-size 16 \
+    --gradient-checkpointing \
+    --learning-rate 2e-4 \
+    --batch-size 1 \
+    --max-length 2048 \
+    --num-epochs 3
+  Want me to start implementing any of these phases? I'd suggest starting with Phase 1 (noise schedule + MLP targets) since that's the most
+   impactful for the diffusion/AR fusion research angle.

sglang/.codespellrc ADDED Viewed

	@@ -0,0 +1,3 @@

+[codespell]
+ignore-words-list = ans, als, hel, boostrap, childs, te, vas, hsa, ment, cann, thi, makro, wil, rouge, PRIS
+skip = *.json,*.jsonl,*.patch,*.txt

sglang/.editorconfig ADDED Viewed

	@@ -0,0 +1,25 @@

+# https://editorconfig.org/
+root = true
+[*]
+charset = utf-8
+end_of_line = lf
+indent_style = space
+indent_size = 4
+trim_trailing_whitespace = true
+insert_final_newline = true
+[*.{json,yaml,yml}]
+indent_size = 2
+[*.md]
+indent_size = 2
+x-soft-wrap-text = true
+[*.rst]
+indent_size = 4
+x-soft-wrap-text = true
+[Makefile]
+indent_style = tab

sglang/.isort.cfg ADDED Viewed

	@@ -0,0 +1,3 @@

+[settings]
+profile=black
+known_first_party=sglang

sglang/.pre-commit-config.yaml ADDED Viewed

	@@ -0,0 +1,83 @@

+default_stages: [pre-commit, pre-push, manual]
+exclude: ^(python/sglang/multimodal_gen/csrc|python/sglang/jit_kernel/flash_attention/cute)
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v6.0.0
+    hooks:
+      - id: check-symlinks
+      - id: destroyed-symlinks
+      - id: trailing-whitespace
+      - id: end-of-file-fixer
+      - id: check-yaml
+        args: [--allow-multiple-documents]
+      - id: check-toml
+      - id: check-ast
+      - id: check-added-large-files
+      - id: check-merge-conflict
+      - id: check-shebang-scripts-are-executable
+      - id: detect-private-key
+        exclude: ^sgl-model-gateway/tests/.*_test\.rs$
+      - id: debug-statements
+      - id: no-commit-to-branch
+  - repo: https://github.com/PyCQA/isort
+    rev: 7.0.0
+    hooks:
+      - id: isort
+        exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.15.1
+    hooks:
+      - id: ruff
+        args:
+          - --select=F401,F821
+          - --fix
+        files: ^(benchmark/|docs/|examples/|python/sglang/|sgl-model-gateway/py_*|test/)
+        exclude: |
+          (?x)^(
+          .*/__init__\.py$|
+          .*\.ipynb$|
+          python/sglang/srt/grpc/.*_pb2\.py$|
+          python/sglang/srt/grpc/.*_pb2_grpc\.py$|
+          python/sglang/srt/grpc/.*_pb2\.pyi$|
+          python/sglang/srt/grpc/.*_pb2_grpc\.pyi$|
+          )$
+  - repo: https://github.com/psf/black
+    rev: 26.1.0
+    hooks:
+      - id: black-jupyter
+        exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
+  - repo: https://github.com/codespell-project/codespell
+    rev: v2.4.1
+    hooks:
+      - id: codespell
+        args: ['--config', '.codespellrc']
+  - repo: https://github.com/pre-commit/mirrors-clang-format
+    rev: v20.1.7
+    hooks:
+    - id: clang-format
+      types_or: [c++, cuda]
+      args: [--style=file, --verbose]
+  - repo: https://github.com/kynan/nbstripout
+    rev: 0.9.0
+    hooks:
+      - id: nbstripout
+        args:
+          - '--keep-output'
+          - '--extra-keys=metadata.kernelspec metadata.language_info.version'
+  - repo: local
+    hooks:
+      - id: check-chinese-characters
+        name: check chinese characters in multimodal_gen
+        entry: >-
+          python3 -c 'import sys, re; p=re.compile(r"[\u4e00-\u9fff]"); ec=0; [ ([(print(f"{f}:{i+1}: {l.strip()}") or (ec:=1)) for i,l in enumerate(open(f, "r", encoding="utf-8", errors="ignore")) if p.search(l)]) for f in sys.argv[1:] ]; sys.exit(ec)'
+        language: system
+        files: ^python/sglang/multimodal_gen/.*
+        exclude: ^(python/sglang/multimodal_gen/configs/sample|python/sglang/multimodal_gen/apps/ComfyUI_SGLDiffusion/workflows|python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages)(/|$)
+        types_or: [python, markdown, json, text]
+      - id: sort-ci-permissions
+        name: sort CI_PERMISSIONS.json
+        entry: python3 .github/update_ci_permission.py --sort-only
+        language: system
+        files: ^\.github/CI_PERMISSIONS\.json$
+        pass_filenames: false

sglang/CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,128 @@

+# Contributor Covenant Code of Conduct
+## Our Pledge
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+## Our Standards
+Examples of behavior that contributes to a positive environment for our
+community include:
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+Examples of unacceptable behavior include:
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Enforcement Responsibilities
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+## Scope
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+.
+All complaints will be reviewed and investigated promptly and fairly.
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+## Enforcement Guidelines
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+### 1. Correction
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+### 2. Warning
+**Community Impact**: A violation through a single incident or series
+of actions.
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+### 3. Temporary Ban
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+### 4. Permanent Ban
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.

sglang/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2023-2024 SGLang Team
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

sglang/README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+<div align="center" id="sglangtop">
+<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img>
+[![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
+![PyPI - Downloads](https://static.pepy.tech/badge/sglang?period=month)
+[![license](https://img.shields.io/github/license/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang/tree/main/LICENSE)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
+[![open issues](https://img.shields.io/github/issues-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
+[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/sgl-project/sglang)
+</div>
+--------------------------------------------------------------------------------
+<p align="center">
+<a href="https://lmsys.org/blog/"><b>Blog</b></a> |
+<a href="https://docs.sglang.io/"><b>Documentation</b></a> |
+<a href="https://roadmap.sglang.io/"><b>Roadmap</b></a> |
+<a href="https://slack.sglang.io/"><b>Join Slack</b></a> |
+<a href="https://meet.sglang.io/"><b>Weekly Dev Meeting</b></a> |
+<a href="https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides"><b>Slides</b></a>
+</p>
+## News
+- [2026/01] 🔥 SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2026-01-16-sglang-diffusion/)).
+- [2025/12] SGLang provides day-0 support for latest open models ([MiMo-V2-Flash](https://lmsys.org/blog/2025-12-16-mimo-v2-flash/), [Nemotron 3 Nano](https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/), [Mistral Large 3](https://github.com/sgl-project/sglang/pull/14213), [LLaDA 2.0 Diffusion LLM](https://lmsys.org/blog/2025-12-19-diffusion-llm/), [MiniMax M2](https://lmsys.org/blog/2025-11-04-miminmax-m2/)).
+- [2025/10] 🔥 SGLang now runs natively on TPU with the SGLang-Jax backend ([blog](https://lmsys.org/blog/2025-10-29-sglang-jax/)).
+- [2025/09] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput ([blog](https://lmsys.org/blog/2025-09-25-gb200-part-2/)).
+- [2025/09] SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention ([blog](https://lmsys.org/blog/2025-09-29-deepseek-V32/)).
+- [2025/08] SGLang x AMD SF Meetup on 8/22: Hands-on GPU workshop, tech talks by AMD/xAI/SGLang, and networking ([Roadmap](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_roadmap.pdf), [Large-scale EP](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_ep.pdf), [Highlights](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_highlights.pdf), [AITER/MoRI](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_aiter_mori.pdf), [Wave](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_wave.pdf)).
+<details>
+<summary>More</summary>
+- [2025/11] SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2025-11-07-sglang-diffusion/)).
+- [2025/10] PyTorch Conference 2025 SGLang Talk ([slide](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/sglang_pytorch_2025.pdf)).
+- [2025/10] SGLang x Nvidia SF Meetup on 10/2 ([recap](https://x.com/lmsysorg/status/1975339501934510231)).
+- [2025/08] SGLang provides day-0 support for OpenAI gpt-oss model ([instructions](https://github.com/sgl-project/sglang/issues/8833))
+- [2025/06] SGLang, the high-performance serving infrastructure powering trillions of tokens daily, has been awarded the third batch of the Open Source AI Grant by a16z ([a16z blog](https://a16z.com/advancing-open-source-ai-through-benchmarks-and-bold-experimentation/)).
+- [2025/05] Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)).
+- [2025/06] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput ([blog](https://lmsys.org/blog/2025-06-16-gb200-part-1/)).
+- [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html))
+- [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/))
+- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
+- [2025/01] SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https://x.com/lmsysorg/status/1887262321636221412))
+- [2024/12] v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
+- [2024/10] The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
+- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
+- [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
+- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
+- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
+- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
+</details>
+## About
+SGLang is a high-performance serving framework for large language models and multimodal models.
+It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
+Its core features include:
+- **Fast Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
+- **Broad Model Support**: Supports a wide range of language models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), reward models (Skywork), and diffusion models (WAN, Qwen-Image), with easy extensibility for adding new models. Compatible with most Hugging Face models and OpenAI APIs.
+- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.
+- **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 400,000 GPUs worldwide.
+- **RL & Post-Training Backbone**: SGLang is a proven rollout backend across the world, with native RL integrations and adoption by well-known post-training frameworks such as [**AReaL**](https://github.com/inclusionAI/AReaL), [**Miles**](https://github.com/radixark/miles), [**slime**](https://github.com/THUDM/slime), [**Tunix**](https://github.com/google/tunix), [**verl**](https://github.com/volcengine/verl) and more.
+## Getting Started
+- [Install SGLang](https://docs.sglang.io/get_started/install.html)
+- [Quick Start](https://docs.sglang.io/basic_usage/send_request.html)
+- [Backend Tutorial](https://docs.sglang.io/basic_usage/openai_api_completions.html)
+- [Frontend Tutorial](https://docs.sglang.io/references/frontend/frontend_tutorial.html)
+- [Contribution Guide](https://docs.sglang.io/developer_guide/contribution_guide.html)
+## Benchmark and Performance
+Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/), [Large-scale expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/), [GB200 rack-scale parallelism](https://lmsys.org/blog/2025-09-25-gb200-part-2/).
+## Adoption and Sponsorship
+SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia.
+As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 400,000 GPUs worldwide.
+SGLang is currently hosted under the non-profit open-source organization [LMSYS](https://lmsys.org/about/).
+<img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/refs/heads/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>
+## Contact Us
+For enterprises interested in adopting or deploying SGLang at scale, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at sglang@lmsys.org
+## Acknowledgment
+We learned the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).

syxin_old/DFLASH_LORA_INJECT_FIXES.md ADDED Viewed

	@@ -0,0 +1,142 @@

+# DFlash LoRA Inject 修复方案
+## 问题背景
+DFlash LoRA inject 训练后评估结果极差：
+- GSM8K: 5.05× → **1.04×**（baseline → LoRA inject）
+- HumanEval: 5.06× → **0.98×**
+- MT-Bench: 2.70× → **0.85×**
+根本原因：LoRA 权重保存格式不正确 + draft model forward 效率问题。
+---
+## Fix 1: LoRA 权重保存修复（最关键）
+### 文件
+`Specforge/scripts/train_dflash_lora_inject.py`
+### 问题
+`save_checkpoint()` 函数（L292-306）手动提取 state_dict 中含 `"lora_"` 的 key 保存为 `adapter_model.safetensors`，然后用 `peft_config["default"].save_pretrained()` 只保存了 LoraConfig 的 JSON。
+但 PEFT 的 `PeftModel.from_pretrained()` 期望：
+1. **标准的 `adapter_config.json`**（不是 LoraConfig 直接序列化的格式）
+2. **key 命名规范不同**（PEFT 内部会自动处理 `base_model.model.` 前缀）
+导致评估加载时出现大量 **"Found missing adapter keys"** 警告，LoRA 权重实际未加载。
+### 修改
+**删除 L295-306：**
+```python
+        lora_state_dict = {
+            k: v for k, v in module.draft_model.model.state_dict().items()
+            if "lora_" in k
+        }
+        try:
+            from safetensors.torch import save_file as safetensors_save
+            safetensors_save(lora_state_dict, os.path.join(save_dir, "adapter_model.safetensors"))
+        except (ImportError, Exception):
+            torch.save(lora_state_dict, os.path.join(save_dir, "adapter_model.bin"))
+        draft_model.model.peft_config["default"].save_pretrained(save_dir)
+```
+**替换为：**
+```python
+        # Use PEFT native save which handles key naming and adapter_config.json correctly
+        module.draft_model.model.save_pretrained(save_dir)
+```
+`PeftModel.save_pretrained()` 会正确：
+- 自动处理 key 前缀映射
+- 生成标准 `adapter_config.json`
+- 保存 `adapter_model.safetensors`
+---
+## Fix 2: Draft Model Forward 优化（性能 + 代码清晰度）
+### 文件
+`Specforge/specforge/modeling/draft/dflash_lora_inject.py`
+### 问题
+`_forward_with_injection()` 当前在 **每一层的 for loop 内** 重复计算：
+- `rotary_emb(layer_input, extended_pos)` — 每层一次，36层重复计算
+- 扩展 attention mask — 每层重建 O(seq_len²) 的 mask
+- `extended_pos = torch.cat([target_pos, position_ids])` — 每层重复拼接
+### 修改
+将 position_embeddings、extended_mask、extended_pos 的计算 **提取到 for loop 之前**（一次性预计算）：
+```python
+def _forward_with_injection(self, input_ids, attention_mask, target_hidden_states,
+                            position_ids=None, output_hidden_states=False, context_len=0):
+    # ... (get base_model, embed_tokens, layers, norm, lm_head)
+    hidden_states = embed_tokens(input_ids)
+    bsz, seq_len, hidden_dim = hidden_states.shape
+    ctx_len = target_hidden_states[0].shape[1] if target_hidden_states else 0
+    full_seq_len = ctx_len + seq_len
+    # ── Pre-compute position embeddings ONCE ──
+    target_pos = torch.arange(ctx_len, device=hidden_states.device)
+    draft_pos_ids = position_ids if position_ids is not None else torch.arange(seq_len, device=hidden_states.device).unsqueeze(0).expand(bsz, -1)
+    extended_pos = torch.cat([
+        target_pos.unsqueeze(0).expand(bsz, -1),
+        draft_pos_ids
+    ], dim=1)
+    position_embeddings = None
+    if hasattr(base_model.model, 'rotary_emb'):
+        dummy = torch.empty(1, full_seq_len, hidden_dim, device=hidden_states.device, dtype=hidden_states.dtype)
+        position_embeddings = base_model.model.rotary_emb(dummy, extended_pos)
+    # ── Pre-compute extended attention mask ONCE ──
+    extended_mask = attention_mask  # fallback
+    if attention_mask is not None and attention_mask.dim() == 4:
+        # ... (build ctx_mask_full + draft_mask_full, same logic as before)
+        # Key: use block_start (NOT block_start + 1) to prevent leakage
+        extended_mask = torch.cat([ctx_mask_full, draft_mask_full], dim=2)
+    # ── Layer-by-layer forward ──
+    for layer_idx, layer in enumerate(layers):
+        if target_hidden_states and layer_idx < len(target_hidden_states):
+            target_ctx = target_hidden_states[layer_idx]
+            layer_input = torch.cat([target_ctx, hidden_states], dim=1)
+            layer_output = layer(
+                layer_input,
+                attention_mask=extended_mask,
+                position_ids=extended_pos,
+                position_embeddings=position_embeddings,
+            )
+            hidden_states = layer_output[0][:, ctx_len:, :] if isinstance(layer_output, tuple) else layer_output[:, ctx_len:, :]
+        else:
+            layer_output = layer(hidden_states, attention_mask=attention_mask, position_ids=position_ids)
+            hidden_states = layer_output[0] if isinstance(layer_output, tuple) else layer_output
+    hidden_states = norm(hidden_states)
+    if output_hidden_states:
+        return hidden_states
+    return lm_head(hidden_states)
+```
+---
+## 不需要改动的文件
+| 文件 | 原因 |
+|------|------|
+| `eval_dflash_lora_inject.py` | 推理逻辑正确，position 已对齐 |
+| `specforge/core/dflash_lora_inject.py` | 训练 wrapper 的 mask (`block_start` 不含 +1) 已正确 |
+---
+## 验证方案
+1. **LoRA roundtrip**: 保存后 `PeftModel.from_pretrained()` 加载无 warning
+2. **Forward 一致性**: 预计算 vs 每层重算输出相同
+3. **端到端评估**: 重新训练 + `eval_dflash_lora_inject.py` 验证 acceptance length 提升

syxin_old/backup.log ADDED Viewed

The diff for this file is too large to render. See raw diff

syxin_old/dflash_8gpu_03-31-13:40.log ADDED Viewed

	@@ -0,0 +1,552 @@

+nohup: ignoring input
+*****************************************
+Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
+*****************************************
+Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0
+Set TORCH_CUDA_ARCH_LIST to 9.0
+Set TORCH_CUDA_ARCH_LIST to 9.0
+Set TORCH_CUDA_ARCH_LIST to 9.0
+/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
+  warnings.warn(
+/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
+  warnings.warn(
+/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
+  warnings.warn(
+/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
+  warnings.warn(
+/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
+  warnings.warn(
+/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
+  warnings.warn(
+/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
+  warnings.warn(
+/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
+  warnings.warn(
+INFO:specforge.utils:rank 0: bind to device 0
+INFO:specforge.utils:rank 7: bind to device 7
+INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
+INFO:specforge.utils:rank 2: bind to device 2
+INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
+INFO:specforge.utils:rank 0: Initialized distributed
+INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend
+INFO:specforge.utils:rank 7: Initialized distributed
+INFO:specforge.utils:rank 1: bind to device 1
+INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
+INFO:specforge.utils:rank 2: Initialized distributed
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+INFO:specforge.utils:rank 6: bind to device 6
+INFO:specforge.utils:rank 5: bind to device 5
+INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
+INFO:specforge.utils:rank 1: Initialized distributed
+`torch_dtype` is deprecated! Use `dtype` instead!
+INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
+INFO:specforge.utils:rank 4: bind to device 4
+INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
+INFO:specforge.utils:rank 6: Initialized distributed
+INFO:specforge.utils:rank 5: Initialized distributed
+INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+INFO:specforge.utils:rank 4: Initialized distributed
+`torch_dtype` is deprecated! Use `dtype` instead!
+INFO:specforge.utils:rank 3: bind to device 3
+[rank2]: Traceback (most recent call last):
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
+[rank2]:     return next(cls.discover(name=name))
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]: StopIteration
+[rank2]: During handling of the above exception, another exception occurred:
+[rank2]: Traceback (most recent call last):
+[rank2]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 723, in <module>
+[rank2]:     main()
+[rank2]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 475, in main
+[rank2]:     target_model, draft_model = build_models(args)
+[rank2]:                                 ^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 265, in build_models
+[rank2]:     target_model = get_dflash_target_model(
+[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 341, in get_dflash_target_model
+[rank2]:     return HFDFlashTargetModel.from_pretrained(
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 278, in from_pretrained
+[rank2]:     target_model = AutoModelForCausalLM.from_pretrained(
+[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
+[rank2]:     return model_class.from_pretrained(
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
+[rank2]:     return func(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
+[rank2]:     model = cls(config, *model_args, **model_kwargs)
+[rank2]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
+[rank2]:     super().__init__(config)
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2076, in __init__
+[rank2]:     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
+[rank2]:                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
+[rank2]:     applicable_attn_implementation = self.get_correct_attn_implementation(
+[rank2]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
+[rank2]:     self._flash_attn_2_can_dispatch(is_init_check)
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2425, in _flash_attn_2_can_dispatch
+[rank2]:     flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
+[rank2]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version
+[rank2]:     return distribution(distribution_name).version
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution
+[rank2]:     return Distribution.from_name(distribution_name)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
+[rank2]:     raise PackageNotFoundError(name)
+[rank2]: importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
+[rank0]:     return next(cls.discover(name=name))
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]: StopIteration
+[rank0]: During handling of the above exception, another exception occurred:
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 723, in <module>
+[rank0]:     main()
+[rank0]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 475, in main
+[rank0]:     target_model, draft_model = build_models(args)
+[rank0]:                                 ^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 265, in build_models
+[rank0]:     target_model = get_dflash_target_model(
+[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 341, in get_dflash_target_model
+[rank0]:     return HFDFlashTargetModel.from_pretrained(
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 278, in from_pretrained
+[rank0]:     target_model = AutoModelForCausalLM.from_pretrained(
+[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
+[rank0]:     return model_class.from_pretrained(
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
+[rank0]:     return func(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
+[rank0]:     model = cls(config, *model_args, **model_kwargs)
+[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
+[rank0]:     super().__init__(config)
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2076, in __init__
+[rank0]:     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
+[rank0]:                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
+[rank0]:     applicable_attn_implementation = self.get_correct_attn_implementation(
+[rank0]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
+[rank0]:     self._flash_attn_2_can_dispatch(is_init_check)
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2425, in _flash_attn_2_can_dispatch
+[rank0]:     flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
+[rank0]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version
+[rank0]:     return distribution(distribution_name).version
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution
+[rank0]:     return Distribution.from_name(distribution_name)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
+[rank0]:     raise PackageNotFoundError(name)
+[rank0]: importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn
+[rank7]: Traceback (most recent call last):
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
+[rank7]:     return next(cls.discover(name=name))
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]: StopIteration
+[rank7]: During handling of the above exception, another exception occurred:
+[rank7]: Traceback (most recent call last):
+[rank7]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 723, in <module>
+[rank7]:     main()
+[rank7]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 475, in main
+[rank7]:     target_model, draft_model = build_models(args)
+[rank7]:                                 ^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 265, in build_models
+[rank7]:     target_model = get_dflash_target_model(
+[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 341, in get_dflash_target_model
+[rank7]:     return HFDFlashTargetModel.from_pretrained(
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 278, in from_pretrained
+[rank7]:     target_model = AutoModelForCausalLM.from_pretrained(
+[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
+[rank7]:     return model_class.from_pretrained(
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
+[rank7]:     return func(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
+[rank7]:     model = cls(config, *model_args, **model_kwargs)
+[rank7]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
+[rank7]:     super().__init__(config)
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2076, in __init__
+[rank7]:     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
+[rank7]:                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
+[rank7]:     applicable_attn_implementation = self.get_correct_attn_implementation(
+[rank7]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
+[rank7]:     self._flash_attn_2_can_dispatch(is_init_check)
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2425, in _flash_attn_2_can_dispatch
+[rank7]:     flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
+[rank7]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version
+[rank7]:     return distribution(distribution_name).version
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution
+[rank7]:     return Distribution.from_name(distribution_name)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
+[rank7]:     raise PackageNotFoundError(name)
+[rank7]: importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn
+[rank1]: Traceback (most recent call last):
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
+[rank1]:     return next(cls.discover(name=name))
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]: StopIteration
+[rank1]: During handling of the above exception, another exception occurred:
+[rank1]: Traceback (most recent call last):
+[rank1]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 723, in <module>
+[rank1]:     main()
+[rank1]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 475, in main
+[rank1]:     target_model, draft_model = build_models(args)
+[rank1]:                                 ^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 265, in build_models
+[rank1]:     target_model = get_dflash_target_model(
+[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 341, in get_dflash_target_model
+[rank1]:     return HFDFlashTargetModel.from_pretrained(
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 278, in from_pretrained
+[rank1]:     target_model = AutoModelForCausalLM.from_pretrained(
+[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
+[rank1]:     return model_class.from_pretrained(
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
+[rank1]:     return func(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
+[rank1]:     model = cls(config, *model_args, **model_kwargs)
+[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
+[rank1]:     super().__init__(config)
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2076, in __init__
+[rank1]:     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
+[rank1]:                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
+[rank1]:     applicable_attn_implementation = self.get_correct_attn_implementation(
+[rank1]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
+[rank1]:     self._flash_attn_2_can_dispatch(is_init_check)
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2425, in _flash_attn_2_can_dispatch
+[rank1]:     flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
+[rank1]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version
+[rank1]:     return distribution(distribution_name).version
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution
+[rank1]:     return Distribution.from_name(distribution_name)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
+[rank1]:     raise PackageNotFoundError(name)
+[rank1]: importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn
+INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
+INFO:specforge.utils:rank 3: Initialized distributed
+`torch_dtype` is deprecated! Use `dtype` instead!
+[rank5]: Traceback (most recent call last):
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
+[rank5]:     return next(cls.discover(name=name))
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]: StopIteration
+[rank5]: During handling of the above exception, another exception occurred:
+[rank5]: Traceback (most recent call last):
+[rank5]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 723, in <module>
+[rank5]:     main()
+[rank5]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 475, in main
+[rank5]:     target_model, draft_model = build_models(args)
+[rank5]:                                 ^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 265, in build_models
+[rank5]:     target_model = get_dflash_target_model(
+[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 341, in get_dflash_target_model
+[rank5]:     return HFDFlashTargetModel.from_pretrained(
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 278, in from_pretrained
+[rank5]:     target_model = AutoModelForCausalLM.from_pretrained(
+[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
+[rank5]:     return model_class.from_pretrained(
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
+[rank5]:     return func(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
+[rank5]:     model = cls(config, *model_args, **model_kwargs)
+[rank5]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
+[rank5]:     super().__init__(config)
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2076, in __init__
+[rank5]:     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
+[rank5]:                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
+[rank5]:     applicable_attn_implementation = self.get_correct_attn_implementation(
+[rank5]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
+[rank5]:     self._flash_attn_2_can_dispatch(is_init_check)
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2425, in _flash_attn_2_can_dispatch
+[rank5]:     flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
+[rank5]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version
+[rank5]:     return distribution(distribution_name).version
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution
+[rank5]:     return Distribution.from_name(distribution_name)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
+[rank5]:     raise PackageNotFoundError(name)
+[rank5]: importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn
+[rank4]: Traceback (most recent call last):
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
+[rank4]:     return next(cls.discover(name=name))
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]: StopIteration
+[rank4]: During handling of the above exception, another exception occurred:
+[rank4]: Traceback (most recent call last):
+[rank4]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 723, in <module>
+[rank4]:     main()
+[rank4]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 475, in main
+[rank4]:     target_model, draft_model = build_models(args)
+[rank4]:                                 ^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 265, in build_models
+[rank4]:     target_model = get_dflash_target_model(
+[rank4]:                    ^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 341, in get_dflash_target_model
+[rank4]:     return HFDFlashTargetModel.from_pretrained(
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 278, in from_pretrained
+[rank4]:     target_model = AutoModelForCausalLM.from_pretrained(
+[rank4]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
+[rank4]:     return model_class.from_pretrained(
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
+[rank4]:     return func(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
+[rank4]:     model = cls(config, *model_args, **model_kwargs)
+[rank4]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
+[rank4]:     super().__init__(config)
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2076, in __init__
+[rank4]:     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
+[rank4]:                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
+[rank4]:     applicable_attn_implementation = self.get_correct_attn_implementation(
+[rank4]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
+[rank4]:     self._flash_attn_2_can_dispatch(is_init_check)
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2425, in _flash_attn_2_can_dispatch
+[rank4]:     flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
+[rank4]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version
+[rank4]:     return distribution(distribution_name).version
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution
+[rank4]:     return Distribution.from_name(distribution_name)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
+[rank4]:     raise PackageNotFoundError(name)
+[rank4]: importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn
+[rank6]: Traceback (most recent call last):
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
+[rank6]:     return next(cls.discover(name=name))
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]: StopIteration
+[rank6]: During handling of the above exception, another exception occurred:
+[rank6]: Traceback (most recent call last):
+[rank6]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 723, in <module>
+[rank6]:     main()
+[rank6]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 475, in main
+[rank6]:     target_model, draft_model = build_models(args)
+[rank6]:                                 ^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 265, in build_models
+[rank6]:     target_model = get_dflash_target_model(
+[rank6]:                    ^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 341, in get_dflash_target_model
+[rank6]:     return HFDFlashTargetModel.from_pretrained(
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 278, in from_pretrained
+[rank6]:     target_model = AutoModelForCausalLM.from_pretrained(
+[rank6]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
+[rank6]:     return model_class.from_pretrained(
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
+[rank6]:     return func(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
+[rank6]:     model = cls(config, *model_args, **model_kwargs)
+[rank6]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
+[rank6]:     super().__init__(config)
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2076, in __init__
+[rank6]:     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
+[rank6]:                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
+[rank6]:     applicable_attn_implementation = self.get_correct_attn_implementation(
+[rank6]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
+[rank6]:     self._flash_attn_2_can_dispatch(is_init_check)
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2425, in _flash_attn_2_can_dispatch
+[rank6]:     flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
+[rank6]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version
+[rank6]:     return distribution(distribution_name).version
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution
+[rank6]:     return Distribution.from_name(distribution_name)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
+[rank6]:     raise PackageNotFoundError(name)
+[rank6]: importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn
+[rank3]: Traceback (most recent call last):
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
+[rank3]:     return next(cls.discover(name=name))
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]: StopIteration
+[rank3]: During handling of the above exception, another exception occurred:
+[rank3]: Traceback (most recent call last):
+[rank3]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 723, in <module>
+[rank3]:     main()
+[rank3]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 475, in main
+[rank3]:     target_model, draft_model = build_models(args)
+[rank3]:                                 ^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py", line 265, in build_models
+[rank3]:     target_model = get_dflash_target_model(
+[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 341, in get_dflash_target_model
+[rank3]:     return HFDFlashTargetModel.from_pretrained(
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/hanrui/syxin_old/Specforge/specforge/modeling/target/dflash_target_model.py", line 278, in from_pretrained
+[rank3]:     target_model = AutoModelForCausalLM.from_pretrained(
+[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
+[rank3]:     return model_class.from_pretrained(
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
+[rank3]:     return func(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
+[rank3]:     model = cls(config, *model_args, **model_kwargs)
+[rank3]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
+[rank3]:     super().__init__(config)
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2076, in __init__
+[rank3]:     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
+[rank3]:                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
+[rank3]:     applicable_attn_implementation = self.get_correct_attn_implementation(
+[rank3]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
+[rank3]:     self._flash_attn_2_can_dispatch(is_init_check)
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2425, in _flash_attn_2_can_dispatch
+[rank3]:     flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
+[rank3]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version
+[rank3]:     return distribution(distribution_name).version
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution
+[rank3]:     return Distribution.from_name(distribution_name)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/workspace/miniconda3/envs/spec/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
+[rank3]:     raise PackageNotFoundError(name)
+[rank3]: importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn
+[rank0]:[W331 13:41:10.473818504 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
+[rank7]:[W331 13:41:10.548010235 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
+[rank7]:[W331 13:41:10.659783753 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
+[rank4]:[W331 13:41:11.950068591 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
+[rank2]:[W331 13:41:11.951701730 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
+[rank6]:[W331 13:41:11.974675832 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
+[rank5]:[W331 13:41:11.997313679 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
+[rank3]:[W331 13:41:11.024650758 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
+[rank1]:[W331 13:41:11.024685351 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
+[rank2]:[W331 13:41:11.101274402 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
+[rank4]:[W331 13:41:11.102711684 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
+[rank6]:[W331 13:41:11.121351120 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
+[rank0]:[W331 13:41:11.122852367 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
+[rank1]:[W331 13:41:11.167109415 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
+[rank3]:[W331 13:41:11.170910568 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
+[rank5]:[W331 13:41:11.173578451 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
+W0331 13:41:11.393000 540 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 641 closing signal SIGTERM
+W0331 13:41:11.393000 540 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 642 closing signal SIGTERM
+W0331 13:41:11.394000 540 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 643 closing signal SIGTERM
+W0331 13:41:11.394000 540 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 644 closing signal SIGTERM
+W0331 13:41:11.394000 540 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 645 closing signal SIGTERM
+W0331 13:41:11.395000 540 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 646 closing signal SIGTERM
+W0331 13:41:11.395000 540 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 647 closing signal SIGTERM
+E0331 13:41:12.401000 540 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 7 (pid: 648) of binary: /workspace/miniconda3/envs/spec/bin/python3
+Traceback (most recent call last):
+  File "<frozen runpy>", line 198, in _run_module_as_main
+  File "<frozen runpy>", line 88, in _run_code
+  File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/torch/distributed/run.py", line 940, in <module>
+    main()
+  File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
+    return f(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^
+  File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main
+    run(args)
+  File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run
+    elastic_launch(
+  File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
+    return launch_agent(self._config, self._entrypoint, list(args))
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/miniconda3/envs/spec/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
+    raise ChildFailedError(
+torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
+============================================================
+/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash.py FAILED
+------------------------------------------------------------
+Failures:
+  <NO_OTHER_FAILURES>
+------------------------------------------------------------
+Root Cause (first observed failure):
+[0]:
+  time      : 2026-03-31_13:41:11
+  host      : job-006ce80a7c47-20260302193512-5cd88f7cfc-mlbh9
+  rank      : 7 (local_rank: 7)
+  exitcode  : 1 (pid: 648)
+  error_file: <N/A>
+  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
+============================================================
+[W331 13:41:12.379198750 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())

syxin_old/diagnostic_compare.py ADDED Viewed

	@@ -0,0 +1,301 @@

+#!/usr/bin/env python3
+"""Diagnostic: compare training forward vs eval forward on the same block.
+Goal: find where the train-eval mismatch is.
+Runs on a single GPU (no distributed).
+Usage:
+    /workspace/miniconda3/envs/dflash/bin/python3 /workspace/hanrui/syxin_old/diagnostic_compare.py
+"""
+import sys, os, warnings
+import torch
+import torch.nn.functional as F
+import numpy as np
+sys.path.insert(0, "/workspace/hanrui/syxin_old")
+sys.path.insert(0, "/workspace/hanrui/syxin_old/Specforge")
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+BASE_MODEL = "/workspace/models/Qwen3-8B"
+ADAPTER_PATH = "/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_4644"
+BLOCK_SIZE = 16
+MASK_TOKEN_ID = 151666
+device = torch.device("cuda:0")
+def main():
+    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
+    print("Loading target model...")
+    target_model = AutoModelForCausalLM.from_pretrained(
+        BASE_MODEL, torch_dtype=torch.bfloat16,
+        attn_implementation="sdpa", device_map=device, trust_remote_code=True,
+    )
+    target_model.eval()
+    print("Loading draft model (base + LoRA adapter)...")
+    draft_model = AutoModelForCausalLM.from_pretrained(
+        BASE_MODEL, torch_dtype=torch.bfloat16,
+        attn_implementation="sdpa", device_map=device, trust_remote_code=True,
+    )
+    draft_model = PeftModel.from_pretrained(draft_model, ADAPTER_PATH)
+    draft_model = draft_model.merge_and_unload()
+    draft_model.eval()
+    num_layers = len(draft_model.model.layers)
+    draft_layers = draft_model.model.layers
+    draft_norm = draft_model.model.norm
+    draft_lm_head = draft_model.lm_head
+    rotary_emb = draft_model.model.rotary_emb
+    # Create a test sequence
+    text = "The quick brown fox jumps over the lazy dog. " * 10
+    messages = [{"role": "user", "content": text}]
+    input_text = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
+    )
+    full_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
+    # Extend to get a sequence long enough for multiple blocks
+    # Use the target model to generate some tokens
+    print(f"Input length: {full_ids.shape[1]}")
+    # We'll create a fixed sequence by repeating the prompt
+    # Actually, let's use the full_ids as is (it should be ~200 tokens)
+    seq_len = full_ids.shape[1]
+    # Make it align to block boundaries
+    n_blocks = seq_len // BLOCK_SIZE
+    effective_len = n_blocks * BLOCK_SIZE
+    input_ids = full_ids[:, :effective_len]
+    seq_len = effective_len
+    print(f"Using {n_blocks} blocks, seq_len = {seq_len}")
+    # ═══════════════════════════════════════════════════════
+    # TRAINING-STYLE FORWARD
+    # ═══════════════════════════════════════════════════════
+    print("\n" + "="*60)
+    print("TRAINING-STYLE FORWARD")
+    print("="*60)
+    with torch.no_grad():
+        # Step 1: Get target hidden states (full sequence)
+        target_output = target_model(
+            input_ids,
+            output_hidden_states=True,
+        )
+        # target hidden states: [hidden_states[0], ..., hidden_states[L-1]]
+        # hidden_states[k] = input to layer k
+        target_hidden_states = [target_output.hidden_states[i] for i in range(num_layers)]
+        # Step 2: Prepare noise input (mask non-anchors)
+        noise_input = input_ids.clone()
+        positions = torch.arange(seq_len, device=device)
+        is_anchor = (positions % BLOCK_SIZE) == 0
+        noise_input[:, ~is_anchor] = MASK_TOKEN_ID
+        # Step 3: Build DFlash mask (draft-to-draft)
+        NEG_INF = torch.finfo(torch.bfloat16).min
+        block_ids_mask = positions // BLOCK_SIZE
+        q_ids = block_ids_mask.unsqueeze(1)
+        k_ids = block_ids_mask.unsqueeze(0)
+        same_block = (q_ids == k_ids)
+        dflash_mask = torch.full((seq_len, seq_len), NEG_INF, device=device, dtype=torch.bfloat16)
+        dflash_mask[same_block] = 0.0
+        dflash_mask = dflash_mask.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, seq_len]
+        # Step 4: Forward through draft model with injection
+        # (Mimicking _forward_with_injection with context_len=0)
+        ctx_len = seq_len  # target hidden states span full sequence
+        full_seq_len = ctx_len + seq_len
+        # Position IDs: [0..N-1, 0..N-1]
+        orig_pos = torch.arange(seq_len, device=device)
+        extended_pos = torch.cat([orig_pos, orig_pos], dim=0).unsqueeze(0)
+        # Extended mask
+        # Context-to-context: causal
+        ctx_ctx_mask = torch.full((ctx_len, ctx_len), NEG_INF, device=device, dtype=torch.bfloat16)
+        ctx_ctx_mask = torch.triu(ctx_ctx_mask, diagonal=1)
+        ctx_draft_mask = torch.full((ctx_len, seq_len), NEG_INF, device=device, dtype=torch.bfloat16)
+        ctx_mask_full = torch.cat([ctx_ctx_mask, ctx_draft_mask], dim=-1)
+        ctx_mask_full = ctx_mask_full.unsqueeze(0).unsqueeze(0)
+        draft_mask_full = torch.full((1, 1, seq_len, full_seq_len), NEG_INF, device=device, dtype=torch.bfloat16)
+        # Draft-to-target visibility
+        draft_pos = torch.arange(seq_len, device=device)
+        target_pos = torch.arange(ctx_len, device=device)
+        context_len = 0  # training uses context_len=0
+        is_ctx = draft_pos < context_len  # all False
+        block_id = (draft_pos - context_len).clamp(min=0) // BLOCK_SIZE
+        block_start = context_len + block_id * BLOCK_SIZE
+        max_visible = torch.where(is_ctx, draft_pos + 1, block_start + 1)
+        visible = target_pos.unsqueeze(0) < max_visible.unsqueeze(1)
+        draft_mask_full[:, :, :, :ctx_len].masked_fill_(
+            visible.unsqueeze(0).unsqueeze(0), 0
+        )
+        # Draft-to-draft
+        draft_mask_full[:, :, :, ctx_len:] = dflash_mask
+        extended_mask = torch.cat([ctx_mask_full, draft_mask_full], dim=2)
+        # Position embeddings
+        dummy = torch.empty(1, full_seq_len, target_hidden_states[0].shape[-1],
+                           device=device, dtype=torch.bfloat16)
+        position_embeddings = rotary_emb(dummy, extended_pos)
+        # Layer-by-layer forward
+        hidden_states = draft_model.model.embed_tokens(noise_input)
+        for layer_idx in range(num_layers):
+            target_ctx = target_hidden_states[layer_idx]
+            layer_input = torch.cat([target_ctx, hidden_states], dim=1)
+            layer_output = draft_layers[layer_idx](
+                layer_input,
+                attention_mask=extended_mask,
+                position_ids=extended_pos,
+                position_embeddings=position_embeddings,
+            )
+            if isinstance(layer_output, tuple):
+                layer_output = layer_output[0]
+            hidden_states = layer_output[:, ctx_len:, :]
+        hidden_states = draft_norm(hidden_states)
+        train_logits = draft_lm_head(hidden_states)  # [1, seq_len, vocab_size]
+        train_preds = train_logits.argmax(dim=-1)  # [1, seq_len]
+        # Compute training accuracy per block
+        print("\nTraining-style per-block accuracy (consecutive correct from pos 1):")
+        for b in range(n_blocks):
+            start = b * BLOCK_SIZE
+            block_preds = train_preds[0, start:start + BLOCK_SIZE]
+            block_labels = input_ids[0, start:start + BLOCK_SIZE]
+            correct = (block_preds[1:] == block_labels[1:])  # skip anchor
+            cumprod = correct.cumprod(dim=0)
+            accept_len = cumprod.sum().item()
+            print(f"  Block {b} (pos {start}-{start+15}): accept_len={accept_len}, "
+                  f"token_acc={correct.float().mean():.3f}")
+    # ═══════════════════════════════════════════════════════
+    # EVAL-STYLE FORWARD (block by block)
+    # ═══════════════════════════════════════════════════════
+    print("\n" + "="*60)
+    print("EVAL-STYLE FORWARD (block by block)")
+    print("="*60)
+    with torch.no_grad():
+        # Get target hidden states for the full sequence (to use as context)
+        # In real eval, these would come from incremental target forwards
+        # Here we use the same full-sequence target hidden states
+        for b in range(n_blocks):
+            start = b * BLOCK_SIZE
+            end = start + BLOCK_SIZE
+            # Block input: anchor + MASK
+            block_ids = input_ids[:, start:end].clone()
+            block_ids[:, 1:] = MASK_TOKEN_ID  # mask non-anchors
+            # Context: target hidden states for positions 0..start (inclusive)
+            # This matches the training visibility: block k sees target 0..k*16
+            ctx_end = start + 1  # include anchor position
+            if ctx_end == 0:
+                # Block 0 with no context — skip (can't have empty context)
+                # Actually block 0 has anchor at position 0, so ctx_end = 1
+                ctx_end = 1
+            ctx_len_eval = ctx_end
+            actual_bs = BLOCK_SIZE
+            # Build eval mask
+            full_len_eval = ctx_len_eval + actual_bs
+            eval_mask = torch.full((1, 1, full_len_eval, full_len_eval), NEG_INF,
+                                   device=device, dtype=torch.bfloat16)
+            # Context-to-context: causal
+            if ctx_len_eval > 0:
+                ctx_rows = torch.arange(ctx_len_eval, device=device)
+                ctx_cols = torch.arange(ctx_len_eval, device=device)
+                causal = ctx_cols.unsqueeze(0) <= ctx_rows.unsqueeze(1)
+                eval_mask[0, 0, :ctx_len_eval, :ctx_len_eval].masked_fill_(causal, 0)
+            # Block-to-context: all visible
+            eval_mask[0, 0, ctx_len_eval:, :ctx_len_eval] = 0
+            # Block-to-block: bidirectional
+            eval_mask[0, 0, ctx_len_eval:, ctx_len_eval:] = 0
+            # Position IDs
+            ctx_positions = torch.arange(ctx_len_eval, device=device)
+            block_positions = torch.arange(start, start + actual_bs, device=device)
+            combined_pos = torch.cat([ctx_positions, block_positions], dim=0).unsqueeze(0)
+            # Position embeddings
+            hidden_dim = target_hidden_states[0].shape[-1]
+            dummy_eval = torch.empty(1, full_len_eval, hidden_dim, device=device, dtype=torch.bfloat16)
+            pos_emb_eval = rotary_emb(dummy_eval, combined_pos)
+            # Draft forward
+            draft_hidden = draft_model.model.embed_tokens(block_ids)
+            for layer_idx in range(num_layers):
+                target_ctx = target_hidden_states[layer_idx][:, :ctx_end, :]
+                combined = torch.cat([target_ctx, draft_hidden], dim=1)
+                layer_output = draft_layers[layer_idx](
+                    combined,
+                    attention_mask=eval_mask,
+                    position_ids=combined_pos,
+                    position_embeddings=pos_emb_eval,
+                )
+                if isinstance(layer_output, tuple):
+                    layer_output = layer_output[0]
+                draft_hidden = layer_output[:, ctx_len_eval:, :]
+            draft_hidden = draft_norm(draft_hidden)
+            eval_logits = draft_lm_head(draft_hidden)  # [1, 16, vocab_size]
+            eval_preds = eval_logits.argmax(dim=-1)  # [1, 16]
+            # Compare with training
+            train_block_preds = train_preds[0, start:end]
+            eval_block_preds = eval_preds[0]
+            block_labels = input_ids[0, start:end]
+            train_correct = (train_block_preds[1:] == block_labels[1:])
+            eval_correct = (eval_block_preds[1:] == block_labels[1:])
+            preds_match = (train_block_preds == eval_block_preds)
+            train_accept = train_correct.cumprod(dim=0).sum().item()
+            eval_accept = eval_correct.cumprod(dim=0).sum().item()
+            # Check if logits are close
+            train_block_logits = train_logits[0, start:end, :]
+            eval_block_logits = eval_logits[0, :, :]
+            logit_diff = (train_block_logits - eval_block_logits).abs().max().item()
+            logit_rmse = ((train_block_logits - eval_block_logits)**2).mean().sqrt().item()
+            print(f"  Block {b} (pos {start}-{start+15}):")
+            print(f"    Train accept_len={train_accept}, Eval accept_len={eval_accept}")
+            print(f"    Predictions match: {preds_match.sum().item()}/{BLOCK_SIZE}")
+            print(f"    Logit max_diff={logit_diff:.4f}, rmse={logit_rmse:.6f}")
+            if not preds_match.all():
+                mismatch_pos = (~preds_match).nonzero(as_tuple=True)[0]
+                for pos in mismatch_pos[:5]:  # show first 5 mismatches
+                    p = pos.item()
+                    print(f"    Mismatch at block pos {p}: "
+                          f"train={train_block_preds[p].item()}, "
+                          f"eval={eval_block_preds[p].item()}, "
+                          f"label={block_labels[p].item()}")
+if __name__ == "__main__":
+    main()

syxin_old/eval_alignment_diff.md ADDED Viewed

	@@ -0,0 +1,132 @@

+# DFlash Eval 对齐分析：你的脚本 vs 官方 benchmark.py
+## 修改总览
+| # | 问题 | 影响 | baseline | lora_inject |
+|---|------|------|----------|-------------|
+| 1 | Acceptance Length 计算方式 | 🔴 数值不同 | ✅ 修 | ✅ 修 |
+| 2 | Multi-turn 对话支持 | 🔴 mt-bench 结果不同 | ✅ 修 | ✅ 修 |
+| 3 | 样本选择：顺序 vs shuffle | 🔴 子集不同 | ✅ 修 | ✅ 修 |
+| 4 | 数据集只有3个，官方10个 | 🔴 覆盖不全 | ✅ 修 | ✅ 修 |
+| 5 | stop_token_ids 检查范围 | 🟡 可能提前/延迟停止 | ✅ 修 | ✅ 修 |
+| 6 | 分布式聚合方式 | 🟡 丢失per-sample粒度 | ✅ 修 | ✅ 修 |
+| 7 | AR baseline 含 draft forward | 🔴 speedup偏高(仅inject) | N/A | ✅ 修 |
+| 8 | max_new_tokens 默认值 | 🟡 | ✅ 修 | ✅ 修 |
+---
+## 修改详情
+### 1. Acceptance Length 计算方式
+**问题：** 官方是 per-sample mean 再取 mean，你的是全局池化。
+```python
+# ❌ 你的（两个脚本都是）
+avg_accept_length = total_accept_sum / total_count
+# ✅ 官方
+tau = np.mean([np.mean(r[block_size].acceptance_lengths) for r in responses])
+```
+**修改：** 收集每个 sample 的 accept_lengths list，先算 per-sample mean，再取 mean。
+---
+### 2. Multi-turn 对话支持
+**问题：** 官方对 mt-bench 等多轮数据集，会逐轮生成并将 assistant 回复加入 context。
+```python
+# ❌ 你的：只取 turns[0]，单轮
+messages = [{"role": "user", "content": prompt}]
+# ✅ 官方：逐轮生成
+for turn_index, user_content in enumerate(instance["turns"]):
+    messages.append({"role": "user", "content": user_content})
+    # generate ...
+    messages.append({"role": "assistant", "content": output_text})
+    responses.append(response)
+```
+**修改：** 数据加载改为返回 `{"turns": [...]}` 格式，生成循环改为逐轮。
+---
+### 3. 样本选择
+**问题：** 官方 shuffle 后选取，你的是顺序取前N。
+```python
+# ❌ 你的
+prompts = prompts[:num_samples]
+# ✅ 官方
+dataset = dataset.shuffle(seed=0).select(range(max_samples))
+```
+**修改：** 改用 HF dataset 的 shuffle + select。
+---
+### 4. 数据集补齐
+**问题：** 缺少 math500, aime24, aime25, mbpp, livecodebench, swe-bench, alpaca 共 7 个。
+**修改：** 直接复用官方 `load_and_process_dataset()` 函数。
+---
+### 5. stop_token_ids 检查范围
+```python
+# ❌ 你的 baseline（检查到 start）
+output_ids[:, num_input_tokens:start]
+# ✅ 官方（检查所有已生成）
+output_ids[:, num_input_tokens:]
+```
+**修改：** 改为 `output_ids[:, num_input_tokens:]`。
+---
+### 6. 分布式聚合
+**问题：** 你用 all_reduce 聚合标量，丢失 per-sample 粒度。
+```python
+# ❌ 你的：all_reduce sum/count
+dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
+# ✅ 官方：gather 完整 response list
+responses = dist.gather(responses, dst=0)
+```
+**修改：** 改为 gather per-sample 的 acceptance_lengths + time metrics 到 rank 0 统一计算。
+---
+### 7. AR baseline 含 draft forward (仅 lora_inject)
+**问题：** block_size=1 时仍跑完整的 inject pipeline（包含 draft model），导致 AR 时间偏大、speedup 偏高。
+```python
+# ❌ 你的 lora_inject AR baseline
+spec_generate_inject(..., block_size=1)  # 仍会过 draft model layers
+# ✅ 应该：纯 target autoregressive
+ar_generate(target_model, input_ids, ...)  # 只用 target
+```
+**修改：** 新增纯 AR 生成函数，block_size=1 时不经过 draft。
+---
+### 8. max_new_tokens 默认值
+```python
+# 官方 shell 脚本用 2048（Python 默认 16384）
+# 你的默认 2048，和 shell 一致，保持不变
+# 但增加提示，显式说明
+```

syxin_old/eval_dflash_b16_baseline.py ADDED Viewed

	@@ -0,0 +1,354 @@

+#!/usr/bin/env python3
+"""
+Offline evaluation for DFlash-b16 baseline: measure accepted length.
+8 GPUs parallel, each GPU loads target + draft independently.
+Usage:
+    # 8 GPUs
+    torchrun --nproc_per_node 8 eval_dflash_b16_baseline.py
+    # quick test
+    torchrun --nproc_per_node 8 eval_dflash_b16_baseline.py --num-samples 20
+    # single GPU
+    python3 eval_dflash_b16_baseline.py --benchmarks humaneval
+"""
+import argparse
+import json
+import os
+import sys
+import time
+from typing import List, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.distributed as dist
+from tqdm import tqdm
+from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, DynamicCache
+# Add DFlash model path so we can import utils
+sys.path.insert(0, "/workspace/models/Qwen3-8B-DFlash-b16")
+from utils import extract_context_feature, sample
+# ──────────────────────────────────────────────────────────────────
+BASE_MODEL = "/workspace/models/Qwen3-8B"
+DRAFT_MODEL = "/workspace/models/Qwen3-8B-DFlash-b16"
+RESULT_DIR = "/workspace/hanrui/syxin_old/Specforge/benchmarks/results"
+# ──────────────────────────────────────────────────────────────────
+# Distributed helpers
+# ──────────────────────────────────────────────────────────────────
+def is_distributed():
+    return dist.is_available() and dist.is_initialized()
+def get_rank():
+    return dist.get_rank() if is_distributed() else 0
+def get_world_size():
+    return dist.get_world_size() if is_distributed() else 1
+def is_main():
+    return get_rank() == 0
+def print_rank0(*args, **kwargs):
+    if is_main():
+        print(*args, **kwargs)
+def split_list(lst, rank, world_size):
+    return [x for i, x in enumerate(lst) if i % world_size == rank]
+# ──────────────────────────────────────────────────────────────────
+# Prompts
+# ──────────────────────────────────────────────────────────────────
+def load_prompts(bench_name: str, num_samples: Optional[int] = None) -> List[str]:
+    local_paths = {
+        "humaneval": "/workspace/hanrui/datasets/humaneval/test.jsonl",
+        "mtbench":   "/workspace/hanrui/datasets/mtbench/question.jsonl",
+        "gsm8k":     "/workspace/hanrui/datasets/gsm8k/test.jsonl",
+    }
+    prompts = []
+    path = local_paths.get(bench_name)
+    if path and os.path.exists(path):
+        with open(path) as f:
+            for line in f:
+                item = json.loads(line)
+                if bench_name == "humaneval":
+                    p = f"Write a solution to the following problem and make sure that it passes the tests:\n```python\n{item['prompt']}\n```"
+                elif bench_name == "mtbench":
+                    p = item.get("turns", [item.get("prompt", "")])[0]
+                elif bench_name == "gsm8k":
+                    p = item["question"] + "\nPlease reason step by step, and put your final answer within \\boxed{}."
+                else:
+                    p = str(item)
+                prompts.append(p)
+    else:
+        from datasets import load_dataset
+        if bench_name == "humaneval":
+            ds = load_dataset("openai/openai_humaneval", split="test")
+            prompts = [f"Write a solution to the following problem and make sure that it passes the tests:\n```python\n{x['prompt']}\n```" for x in ds]
+        elif bench_name == "mtbench":
+            ds = load_dataset("HuggingFaceH4/mt_bench_prompts", split="train")
+            prompts = [x["prompt"][0] for x in ds]
+        elif bench_name == "gsm8k":
+            ds = load_dataset("openai/gsm8k", "main", split="test")
+            prompts = [x["question"] + "\nPlease reason step by step, and put your final answer within \\boxed{}." for x in ds]
+    if num_samples is not None:
+        prompts = prompts[:num_samples]
+    return prompts
+# ──────────────────────────────────────────────────────────────────
+# spec_generate with acceptance_lengths returned
+# (Same logic as DFlashDraftModel.spec_generate but returns accept lens)
+# ──────────────────────────────────────────────────────────────────
+@torch.inference_mode()
+def spec_generate_b16(
+    draft_model,
+    target_model: nn.Module,
+    input_ids: torch.LongTensor,
+    max_new_tokens: int = 512,
+    temperature: float = 0.0,
+    stop_token_ids: Optional[List[int]] = None,
+) -> Tuple[torch.Tensor, List[int]]:
+    """Same as DFlashDraftModel.spec_generate but also returns acceptance_lengths."""
+    draft_model.eval()
+    device = target_model.device if hasattr(target_model, 'device') else input_ids.device
+    num_input_tokens = input_ids.shape[1]
+    max_length = num_input_tokens + max_new_tokens
+    block_size = draft_model.block_size
+    mask_token_id = draft_model.mask_token_id
+    output_ids = torch.full(
+        (1, max_length + block_size), mask_token_id,
+        dtype=torch.long, device=device,
+    )
+    position_ids = torch.arange(output_ids.shape[1], device=device).unsqueeze(0)
+    past_key_values_target = DynamicCache()
+    past_key_values_draft = DynamicCache()
+    # Prefill
+    output = target_model(
+        input_ids,
+        position_ids=position_ids[:, :num_input_tokens],
+        past_key_values=past_key_values_target,
+        use_cache=True,
+        logits_to_keep=1,
+        output_hidden_states=True,
+    )
+    output_ids[:, :num_input_tokens] = input_ids
+    output_ids[:, num_input_tokens:num_input_tokens + 1] = sample(output.logits, temperature)
+    target_hidden = extract_context_feature(output.hidden_states, draft_model.target_layer_ids)
+    # Decode
+    acceptance_lengths = []
+    start = num_input_tokens
+    while start < max_length:
+        block_output_ids = output_ids[:, start:start + block_size].clone()
+        block_position_ids = position_ids[:, start:start + block_size]
+        noise_embedding = target_model.model.embed_tokens(block_output_ids)
+        draft_logits = target_model.lm_head(
+            draft_model(
+                target_hidden=target_hidden,
+                noise_embedding=noise_embedding,
+                position_ids=position_ids[:, past_key_values_draft.get_seq_length():start + block_size],
+                past_key_values=past_key_values_draft,
+                use_cache=True,
+                is_causal=False,
+            )[:, -block_size + 1:, :]
+        )
+        past_key_values_draft.crop(start)
+        block_output_ids[:, 1:] = sample(draft_logits)
+        output = target_model(
+            block_output_ids,
+            position_ids=block_position_ids,
+            past_key_values=past_key_values_target,
+            use_cache=True,
+            output_hidden_states=True,
+        )
+        posterior = sample(output.logits, temperature)
+        acceptance_length = (
+            (block_output_ids[:, 1:] == posterior[:, :-1])
+            .cumprod(dim=1).sum(dim=1)[0].item()
+        )
+        output_ids[:, start:start + int(acceptance_length) + 1] = block_output_ids[:, :int(acceptance_length) + 1]
+        output_ids[:, start + int(acceptance_length) + 1] = posterior[:, int(acceptance_length)]
+        start += int(acceptance_length) + 1
+        past_key_values_target.crop(start)
+        target_hidden = extract_context_feature(
+            output.hidden_states, draft_model.target_layer_ids
+        )[:, :int(acceptance_length) + 1, :]
+        acceptance_lengths.append(int(acceptance_length) + 1)
+        if stop_token_ids is not None and any(
+            sid in output_ids[:, num_input_tokens:start] for sid in stop_token_ids
+        ):
+            break
+    output_ids = output_ids[:, :max_length]
+    output_ids = output_ids[:, output_ids[0] != mask_token_id]
+    if stop_token_ids is not None:
+        stop_t = torch.tensor(stop_token_ids, device=output_ids.device)
+        stop_idx = torch.isin(output_ids[0][num_input_tokens:], stop_t).nonzero(as_tuple=True)[0]
+        if stop_idx.numel() > 0:
+            output_ids = output_ids[:, :num_input_tokens + stop_idx[0] + 1]
+    return output_ids, acceptance_lengths
+# ──────────────────────────────────────────────────────────────────
+def parse_args():
+    p = argparse.ArgumentParser()
+    p.add_argument("--base-model", default=BASE_MODEL)
+    p.add_argument("--draft-model", default=DRAFT_MODEL)
+    p.add_argument("--max-new-tokens", type=int, default=512)
+    p.add_argument("--temperature", type=float, default=0.0)
+    p.add_argument("--benchmarks", nargs="+", default=["humaneval", "mtbench", "gsm8k"])
+    p.add_argument("--num-samples", type=int, default=None)
+    p.add_argument("--output-dir", default=RESULT_DIR)
+    return p.parse_args()
+def main():
+    args = parse_args()
+    local_rank = int(os.environ.get("LOCAL_RANK", 0))
+    world_size = int(os.environ.get("WORLD_SIZE", 1))
+    if world_size > 1:
+        dist.init_process_group(backend="nccl")
+        torch.cuda.set_device(local_rank)
+    device = f"cuda:{local_rank}"
+    rank = get_rank()
+    print_rank0(f"Running DFlash-b16 baseline on {world_size} GPU(s)")
+    # ── Load models ──
+    print_rank0(f"Loading target: {args.base_model}")
+    target_model = AutoModelForCausalLM.from_pretrained(
+        args.base_model,
+        torch_dtype=torch.bfloat16,
+        device_map=device,
+        trust_remote_code=True,
+    )
+    target_model.eval()
+    print_rank0(f"Loading DFlash-b16 draft: {args.draft_model}")
+    draft_model = AutoModel.from_pretrained(
+        args.draft_model,
+        torch_dtype=torch.bfloat16,
+        trust_remote_code=True,
+    ).to(device)
+    draft_model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(args.base_model, trust_remote_code=True)
+    stop_token_ids = [tokenizer.eos_token_id]
+    print_rank0(f"DFlash-b16: block_size={draft_model.block_size}, "
+                f"target_layer_ids={draft_model.target_layer_ids}, "
+                f"num_layers={len(draft_model.layers)}")
+    # ── Run benchmarks ──
+    results = {"model": "Qwen3-8B-DFlash-b16", "type": "baseline",
+               "block_size": draft_model.block_size}
+    for bench_name in args.benchmarks:
+        print_rank0(f"\n{'='*60}")
+        print_rank0(f"Benchmark: {bench_name} ({world_size} GPUs)")
+        print_rank0(f"{'='*60}")
+        all_prompts = load_prompts(bench_name, args.num_samples)
+        my_prompts = split_list(all_prompts, rank, world_size)
+        print_rank0(f"Total {len(all_prompts)} prompts, ~{len(my_prompts)} per GPU")
+        local_accept_lengths = []
+        local_tokens = 0
+        t0 = time.time()
+        iterator = tqdm(my_prompts, desc=f"[GPU{rank}] {bench_name}", unit="sample",
+                        disable=(rank != 0))
+        for prompt in iterator:
+            messages = [{"role": "user", "content": prompt}]
+            text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+            input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)
+            output_ids, accept_lens = spec_generate_b16(
+                draft_model=draft_model,
+                target_model=target_model,
+                input_ids=input_ids,
+                max_new_tokens=args.max_new_tokens,
+                temperature=args.temperature,
+                stop_token_ids=stop_token_ids,
+            )
+            local_accept_lengths.extend(accept_lens)
+            num_gen = output_ids.shape[1] - input_ids.shape[1]
+            local_tokens += num_gen
+            if rank == 0 and len(local_accept_lengths) > 0:
+                avg = sum(local_accept_lengths) / len(local_accept_lengths)
+                iterator.set_postfix(accept_len=f"{avg:.2f}", tokens=local_tokens, gen=num_gen)
+        elapsed = time.time() - t0
+        # ── Gather ──
+        if world_size > 1:
+            local_sum = torch.tensor(sum(local_accept_lengths), dtype=torch.float64, device=device)
+            local_count = torch.tensor(len(local_accept_lengths), dtype=torch.long, device=device)
+            local_tok = torch.tensor(local_tokens, dtype=torch.long, device=device)
+            dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
+            dist.all_reduce(local_count, op=dist.ReduceOp.SUM)
+            dist.all_reduce(local_tok, op=dist.ReduceOp.SUM)
+            total_accept_sum = local_sum.item()
+            total_count = local_count.item()
+            total_tokens = local_tok.item()
+        else:
+            total_accept_sum = sum(local_accept_lengths)
+            total_count = len(local_accept_lengths)
+            total_tokens = local_tokens
+        avg_accept_length = total_accept_sum / max(total_count, 1)
+        throughput = total_tokens / elapsed if elapsed > 0 else 0
+        print_rank0(f"\n{bench_name} Results:")
+        print_rank0(f"  Avg Accept Length: {avg_accept_length:.3f}")
+        print_rank0(f"  Total tokens: {total_tokens}")
+        print_rank0(f"  Latency: {elapsed:.1f}s")
+        print_rank0(f"  Throughput: {throughput:.1f} tok/s (aggregate {world_size} GPUs)")
+        print_rank0(f"  Num verify rounds: {total_count}")
+        print_rank0(f"  Num samples: {len(all_prompts)}")
+        results[bench_name] = {
+            "avg_accept_length": avg_accept_length,
+            "total_tokens": total_tokens,
+            "latency": elapsed,
+            "throughput": throughput,
+            "num_samples": len(all_prompts),
+            "num_verify_rounds": total_count,
+            "num_gpus": world_size,
+        }
+    # ── Save ──
+    if is_main():
+        os.makedirs(args.output_dir, exist_ok=True)
+        timestamp = time.strftime("%Y%m%d_%H%M%S")
+        result_file = os.path.join(
+            args.output_dir,
+            f"dflash_b16_baseline_offline_{timestamp}.json",
+        )
+        with open(result_file, "w") as f:
+            json.dump(results, f, indent=2)
+        print(f"\nResults saved to: {result_file}")
+    if world_size > 1:
+        dist.destroy_process_group()
+if __name__ == "__main__":
+    main()

syxin_old/eval_dflash_b16_baseline_changelog.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# eval_dflash_b16_baseline.py 修改记录
+对照官方仓库 `/workspace/hanrui/dflash/benchmark.py` 和 `run_benchmark.sh`，修复了以下问题。
+---
+## 1. [Critical] 添加 `attn_implementation` 参数
+**问题**: 模型加载时未指定 attention 实现，默认使用 eager attention，性能远低于论文。
+**修改**: target_model 和 draft_model 均添加 `attn_implementation="flash_attention_2"`（flash_attn 未安装时自动降级为 `"sdpa"`）。
+```python
+# Before
+target_model = AutoModelForCausalLM.from_pretrained(args.base_model, torch_dtype=torch.bfloat16, ...)
+draft_model = AutoModel.from_pretrained(args.draft_model, torch_dtype=torch.bfloat16, ...)
+# After
+attn_impl = "flash_attention_2" if installed_flash_attn else "sdpa"
+target_model = AutoModelForCausalLM.from_pretrained(args.base_model, torch_dtype=torch.bfloat16, attn_implementation=attn_impl, ...)
+draft_model = AutoModel.from_pretrained(args.draft_model, torch_dtype=torch.bfloat16, attn_implementation=attn_impl, ...)
+```
+---
+## 2. [Critical] 添加 `enable_thinking=False`
+**问题**: Qwen3 系列模型默认启用 thinking mode，会在输出中插入大量 `<think>...</think>` 内容，导致生成内容和长度与论文测试条件完全不同，acceptance length 指标不可比。
+**修改**: `apply_chat_template` 调用中添加 `enable_thinking=False`。
+```python
+# Before
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+# After
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
+```
+---
+## 3. [Critical] 添加 autoregressive baseline (block_size=1) 以计算 speedup
+**问题**: 原脚本只跑了 speculative decoding（block_size=16），没有跑 autoregressive baseline（block_size=1），因此无法计算论文中报告的 decoding speedup 指标。
+**修改**: 对每个 prompt 先用 `block_size=1` 跑一次 baseline，再用 `block_size=block_size` 跑 speculative decoding，计算 `speedup = t1 / tb`。`spec_generate_b16` 函数新增 `block_size` 参数以支持 block_size=1 模式。
+```python
+# 对每个 prompt:
+_, _, t1 = spec_generate_b16(..., block_size=1, ...)       # autoregressive
+output_ids, accept_lens, tb = spec_generate_b16(..., block_size=block_size, ...)  # speculative
+# speedup = mean(t1) / mean(tb)
+```
+---
+## 4. [Critical] 修复计时方式：`time.time()` → `cuda_time()`
+**问题**: 原脚本使用 `time.time()` 测量 wall clock 时间，不含 CUDA synchronize，GPU 异步执行导致计时不准确。
+**修改**: 新增 `cuda_time()` 函数（与官方一致），在计时点调用 `torch.cuda.synchronize()` + `time.perf_counter()`。`spec_generate_b16` 内部使用 `cuda_time()` 精确测量 prefill 和 decode 阶段耗时，并返回 `time_per_output_token`。
+```python
+def cuda_time() -> float:
+    torch.cuda.synchronize()
+    return time.perf_counter()
+```
+---
+## 5. [Critical] 添加 draft prefill 计时修正
+**问题**: 原脚本将 draft model 的首次 prefill 时间计入了 decode 阶段，导致 time_per_token 偏高、speedup 偏低。
+**修改**: 在 spec_generate_b16 的 decode 循环中，第一次 draft forward 完成后重置 `decode_start`（与官方 `draft_prefill` flag 逻辑一致）。
+```python
+if draft_prefill:
+    draft_prefill = False
+    decode_start = cuda_time()  # 重置，排除 draft 首次 prefill
+```
+---
+## 6. [Important] `max_new_tokens` 默认值 512 → 2048
+**问题**: 原脚本默认 `max_new_tokens=512`，官方 `run_benchmark.sh` 使用 `2048`。生成长度不足会导致 acceptance length 统计样本量不够，指标与论文不可比。
+**修改**: 默认值改为 `2048`。
+```python
+# Before
+p.add_argument("--max-new-tokens", type=int, default=512)
+# After
+p.add_argument("--max-new-tokens", type=int, default=2048)
+```
+---
+## 7. [Important] 添加固定随机种子
+**问题**: 原脚本未设置随机种子，多次运行结果不可复现。
+**修改**: 在 `main()` 开头添加与官方一致的种子设置。
+```python
+random.seed(0)
+np.random.seed(0)
+torch.manual_seed(0)
+torch.cuda.manual_seed_all(0)
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+```
+---
+## 8. [Important] spec_generate_b16 支持 block_size=1 条件分支
+**问题**: 原函数硬编码使用 `draft_model.block_size`，且始终 `output_hidden_states=True`。当 block_size=1（autoregressive baseline）时，不应调用 draft model 和 extract hidden states。
+**修改**:
+- 新增 `block_size` 参数
+- `output_hidden_states` 仅在 `block_size > 1` 时为 True
+- draft model forward 和 hidden state 提取仅在 `block_size > 1` 时执行
+---
+## 9. [Minor] 输出增加 speedup 和 acceptance histogram
+**修改**: 结果输出中新增：
+- `Decoding speedup: X.XXx`（t1/tb 比值）
+- `Acceptance length histogram`（各 acceptance length 的占比分布）
+与官方 benchmark.py 的输出格式对齐。
+---
+## 未修改项
+- **模型选择**: 保留 Qwen3-8B（非官方默认的 Qwen3-4B），因为你本地模型就是 8B
+- **Draft 模型加载方式**: 保留 `AutoModel.from_pretrained`（依赖 `trust_remote_code=True`），未改为官方的 `DFlashDraftModel`，因为需要模型目录下的 remote code 支持
+- **数据集范围**: 保留原有的 3 个 benchmark（humaneval/mtbench/gsm8k），未扩展到官方的 10 个

syxin_old/eval_dflash_lora_inject.py ADDED Viewed

	@@ -0,0 +1,660 @@

+#!/usr/bin/env python3
+"""
+Offline evaluation for DFlash-LoRA-Inject: measure accepted length & speedup.
+Aligned with official DFlash benchmark.py methodology.
+Unlike DFlash-b16 which uses a small 5-layer draft model with fc/hidden_norm,
+LoRA-Inject uses a full Qwen3-8B with LoRA adapters that receives target hidden
+states via layer-by-layer injection.
+Usage:
+    conda activate spec
+    # 8 GPU parallel (default, all 10 benchmarks)
+    torchrun --nproc_per_node 8 eval_dflash_lora_inject.py
+    # single GPU
+    python3 eval_dflash_lora_inject.py
+    # specific checkpoint / benchmark
+    torchrun --nproc_per_node 8 eval_dflash_lora_inject.py --ckpt epoch_0_step_1000 --datasets humaneval
+    # quick test
+    torchrun --nproc_per_node 8 eval_dflash_lora_inject.py --max-samples 20
+"""
+import argparse
+import json
+import os
+import random
+import sys
+import time
+import warnings
+from itertools import chain
+from types import SimpleNamespace
+from typing import List, Optional, Tuple
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.distributed as dist
+from peft import PeftModel
+from tqdm import tqdm
+from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
+# Import official dataset loader
+sys.path.insert(0, "/workspace/hanrui/dflash")
+from model.utils import load_and_process_dataset
+# ──────────────────────────────────────────────────────────────────
+# Config defaults
+# ──────────────────────────────────────────────────────────────────
+BASE_MODEL = "/workspace/models/Qwen3-8B"
+ADAPTER_ROOT = "/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject"
+DEFAULT_CKPT = "epoch_3_step_4644"
+MASK_TOKEN_ID = 151669          # Qwen3 <|mask|>
+BLOCK_SIZE = 16
+RESULT_DIR = "/workspace/hanrui/syxin_old/Specforge/benchmarks/results"
+# Official benchmark tasks (from run_benchmark.sh)
+OFFICIAL_TASKS = {
+    "gsm8k": 128,
+    "math500": 128,
+    "aime24": 30,
+    "aime25": 30,
+    "humaneval": 164,
+    "mbpp": 128,
+    "livecodebench": 128,
+    "swe-bench": 128,
+    "mt-bench": 80,
+    "alpaca": 128,
+}
+# ──────────────────────────────────────────────────────────────────
+# CUDA-synchronised timer (matches official benchmark.py)
+# ──────────────────────────────────────────────────────────────────
+def cuda_time() -> float:
+    torch.cuda.synchronize()
+    return time.perf_counter()
+def has_flash_attn() -> bool:
+    try:
+        import flash_attn  # noqa: F401
+        return True
+    except ImportError:
+        print("[WARN] flash_attn not installed, falling back to sdpa.")
+        return False
+# ──────────────────────────────────────────────────────────────────
+# Distributed helpers (mirrors official distributed.py)
+# ──────────────────────────────────────────────────────────────────
+def dist_init():
+    if "RANK" not in os.environ:
+        warnings.warn("RANK not set. Skipping distributed init.")
+        return
+    dist.init_process_group(backend="nccl", init_method="env://")
+def dist_rank():
+    return int(os.environ.get("RANK", 0))
+def dist_size():
+    return int(os.environ.get("WORLD_SIZE", 1))
+def dist_local_rank():
+    return int(os.environ.get("LOCAL_RANK", 0))
+def dist_is_main():
+    return dist_rank() == 0
+def dist_gather(obj, dst=0):
+    if not dist.is_initialized():
+        return [obj]
+    if dist_is_main():
+        objs = [None for _ in range(dist_size())]
+        dist.gather_object(obj, objs, dst=dst)
+        return objs
+    else:
+        dist.gather_object(obj, dst=dst)
+        return None
+def print_rank0(*args, **kwargs):
+    if dist_is_main():
+        print(*args, **kwargs)
+# ──────────────────────────────────────────────────────────────────
+# Sampling (matches official model/utils.py::sample)
+# ──────────────────────────────────────────────────────────────────
+def sample(logits: torch.Tensor, temperature: float = 0.0) -> torch.Tensor:
+    if temperature < 1e-5:
+        return torch.argmax(logits, dim=-1)
+    bsz, seq_len, vocab_size = logits.shape
+    logits = logits.view(-1, vocab_size)
+    logits = logits / temperature
+    probs = torch.softmax(logits, dim=-1)
+    return torch.multinomial(probs, num_samples=1).view(bsz, seq_len)
+# ──────────────────────────────────────────────────────────────────
+# Build DFlash attention mask (vectorized, no Python loops)
+# ──────────────────────────────────────────────────────────────────
+def build_dflash_mask(ctx_len: int, block_size: int, device, dtype=torch.bfloat16):
+    """
+    Build DFlash attention mask for [context | block] sequence.
+    - Context part: standard causal
+    - Block part: each token sees all context + all tokens in same block (bidirectional)
+    """
+    full_len = ctx_len + block_size
+    neg_inf = torch.finfo(dtype).min
+    mask = torch.full((1, 1, full_len, full_len), neg_inf, device=device, dtype=dtype)
+    if ctx_len > 0:
+        ctx_rows = torch.arange(ctx_len, device=device)
+        ctx_cols = torch.arange(ctx_len, device=device)
+        causal = ctx_cols.unsqueeze(0) <= ctx_rows.unsqueeze(1)
+        mask[0, 0, :ctx_len, :ctx_len].masked_fill_(causal, 0)
+    if ctx_len > 0:
+        mask[0, 0, ctx_len:, :ctx_len] = 0
+    mask[0, 0, ctx_len:, ctx_len:] = 0
+    return mask
+# ──────────────────────────────────────────────────────────────────
+# Pure autoregressive generation (target model only, no draft)
+# Used for AR baseline timing -- avoids inflating AR time with draft overhead.
+# ──────────────────────────────────────────────────────────────────
+@torch.inference_mode()
+def ar_generate(
+    target_model: nn.Module,
+    input_ids: torch.LongTensor,
+    max_new_tokens: int = 2048,
+    mask_token_id: int = MASK_TOKEN_ID,
+    temperature: float = 0.0,
+    stop_token_ids: Optional[List[int]] = None,
+) -> SimpleNamespace:
+    """
+    Pure autoregressive generation using only the target model.
+    Mirrors official benchmark.py with block_size=1 (no draft model involved).
+    Returns SimpleNamespace matching official dflash_generate output format.
+    """
+    device = input_ids.device
+    num_input_tokens = input_ids.shape[1]
+    max_length = num_input_tokens + max_new_tokens
+    output_ids = torch.full(
+        (1, max_length + 1), mask_token_id,
+        dtype=torch.long, device=device,
+    )
+    output_ids[:, :num_input_tokens] = input_ids
+    position_ids = torch.arange(output_ids.shape[1], device=device).unsqueeze(0)
+    past_key_values = DynamicCache()
+    # Prefill
+    prefill_start = cuda_time()
+    output = target_model(
+        input_ids,
+        position_ids=position_ids[:, :num_input_tokens],
+        past_key_values=past_key_values,
+        use_cache=True,
+        logits_to_keep=1,
+        output_hidden_states=False,
+    )
+    first_token = sample(output.logits, temperature)
+    output_ids[:, num_input_tokens:num_input_tokens + 1] = first_token
+    time_to_first_token = cuda_time() - prefill_start
+    # Decode (autoregressive, one token at a time)
+    decode_start = cuda_time()
+    start = num_input_tokens
+    while start < max_length:
+        cur_token = output_ids[:, start:start + 1]
+        cur_pos = position_ids[:, start:start + 1]
+        output = target_model(
+            cur_token,
+            position_ids=cur_pos,
+            past_key_values=past_key_values,
+            use_cache=True,
+            output_hidden_states=False,
+        )
+        next_token = sample(output.logits, temperature)
+        start += 1
+        output_ids[:, start:start + 1] = next_token
+        past_key_values.crop(start)
+        # Check stop tokens (matches official: check all generated)
+        if stop_token_ids is not None and any(
+            sid in output_ids[:, num_input_tokens:] for sid in stop_token_ids
+        ):
+            break
+    output_ids = output_ids[:, :max_length]
+    output_ids = output_ids[:, output_ids[0] != mask_token_id]
+    if stop_token_ids is not None:
+        stop_t = torch.tensor(stop_token_ids, device=output_ids.device)
+        stop_idx = torch.isin(output_ids[0][num_input_tokens:], stop_t).nonzero(as_tuple=True)[0]
+        if stop_idx.numel() > 0:
+            output_ids = output_ids[:, :num_input_tokens + stop_idx[0] + 1]
+    num_output_tokens = output_ids.shape[1] - num_input_tokens
+    total_decode_time = cuda_time() - decode_start
+    time_per_output_token = total_decode_time / max(num_output_tokens, 1)
+    return SimpleNamespace(
+        output_ids=output_ids,
+        num_input_tokens=num_input_tokens,
+        num_output_tokens=num_output_tokens,
+        time_to_first_token=time_to_first_token,
+        time_per_output_token=time_per_output_token,
+        acceptance_lengths=[1] * max(num_output_tokens, 0),  # AR: always 1
+    )
+# ──────────────────────────────────────────────────────────────────
+# Core: spec_generate with layer-by-layer injection (KV-cached)
+# ──────────────────────────────────────────────────────────────────
+@torch.inference_mode()
+def spec_generate_inject(
+    target_model: nn.Module,
+    draft_model: nn.Module,
+    input_ids: torch.LongTensor,
+    max_new_tokens: int = 2048,
+    block_size: int = 16,
+    mask_token_id: int = MASK_TOKEN_ID,
+    temperature: float = 0.0,
+    stop_token_ids: Optional[List[int]] = None,
+) -> SimpleNamespace:
+    """
+    Speculative generation using DFlash-LoRA-Inject inference pattern.
+    Returns SimpleNamespace matching official dflash_generate output format.
+    """
+    device = input_ids.device
+    num_input_tokens = input_ids.shape[1]
+    max_length = num_input_tokens + max_new_tokens
+    draft_layers = draft_model.model.layers
+    draft_norm = draft_model.model.norm
+    draft_lm_head = draft_model.lm_head
+    rotary_emb = draft_model.model.rotary_emb
+    num_layers = len(draft_layers)
+    output_ids = torch.full(
+        (1, max_length + block_size), mask_token_id,
+        dtype=torch.long, device=device,
+    )
+    output_ids[:, :num_input_tokens] = input_ids
+    # ── Prefill: target with KV cache + hidden states ──
+    prefill_start = cuda_time()
+    target_kv = DynamicCache()
+    target_output = target_model(
+        input_ids,
+        past_key_values=target_kv,
+        use_cache=True,
+        output_hidden_states=True,
+    )
+    first_token = sample(target_output.logits[:, -1:, :], temperature)
+    output_ids[:, num_input_tokens] = first_token.squeeze()
+    ctx_hidden_per_layer = [
+        target_output.hidden_states[i]
+        for i in range(num_layers)
+    ]
+    time_to_first_token = cuda_time() - prefill_start
+    # Decode
+    decode_start = cuda_time()
+    acceptance_lengths = []
+    start = num_input_tokens
+    draft_prefill = True
+    while start < max_length:
+        end = min(start + block_size, max_length)
+        actual_block_size = end - start
+        block_ids = output_ids[:, start:end].clone()
+        # ── FIX: Get anchor's target hidden state before draft forward ──
+        # Training: block k's draft tokens see target hidden states at positions
+        # 0..k*block_size INCLUSIVE (the anchor position). But in eval, ctx_hidden
+        # only covers 0..start-1. We must process the anchor through the target
+        # model to get its hidden state, matching training's attention pattern.
+        anchor_token = output_ids[:, start:start + 1]
+        anchor_pos = torch.tensor([[start]], device=device)
+        anchor_output = target_model(
+            anchor_token,
+            position_ids=anchor_pos,
+            past_key_values=target_kv,
+            use_cache=True,
+            output_hidden_states=True,
+        )
+        # Save anchor hidden states (one per layer)
+        for i in range(num_layers):
+            anchor_hs = anchor_output.hidden_states[i]  # [1, 1, hidden_dim]
+            ctx_hidden_per_layer[i] = torch.cat([ctx_hidden_per_layer[i], anchor_hs], dim=1)
+        # Roll back KV cache: verification will re-process from position start
+        target_kv.crop(start)
+        # ── Draft: forward with layer-by-layer injection ──
+        draft_hidden = draft_model.model.embed_tokens(block_ids)
+        ctx_len = ctx_hidden_per_layer[0].shape[1]  # now includes anchor at position start
+        dflash_mask = build_dflash_mask(ctx_len, actual_block_size, device)
+        # Position IDs: context covers [0..start], block covers [start..start+bs-1]
+        # Position 'start' appears twice (in both context and block), matching
+        # training where target and draft share the same position IDs.
+        ctx_positions = torch.arange(ctx_len, device=device)
+        block_positions = torch.arange(start, start + actual_block_size, device=device)
+        combined_pos = torch.cat([ctx_positions, block_positions], dim=0).unsqueeze(0)
+        dummy_combined = torch.empty(1, ctx_len + actual_block_size, draft_hidden.shape[-1],
+                                     device=device, dtype=torch.bfloat16)
+        position_embeddings = rotary_emb(dummy_combined, combined_pos)
+        for layer_idx in range(num_layers):
+            target_ctx = ctx_hidden_per_layer[layer_idx]
+            combined = torch.cat([target_ctx, draft_hidden], dim=1)
+            layer_output = draft_layers[layer_idx](
+                combined,
+                attention_mask=dflash_mask,
+                position_ids=combined_pos,
+                position_embeddings=position_embeddings,
+            )
+            if isinstance(layer_output, tuple):
+                layer_output = layer_output[0]
+            draft_hidden = layer_output[:, ctx_len:, :]
+        draft_hidden = draft_norm(draft_hidden)
+        draft_logits = draft_lm_head(draft_hidden)
+        draft_predictions = sample(draft_logits[:, 1:, :], temperature)
+        block_ids[:, 1:actual_block_size] = draft_predictions[:, :actual_block_size - 1]
+        # Exclude draft's first prefill from decode timing (matches official pattern)
+        if draft_prefill:
+            draft_prefill = False
+            decode_start = cuda_time()
+        # ── Verify: target forward on block tokens (with KV cache) ──
+        position_ids_block = torch.arange(
+            start, start + actual_block_size, device=device
+        ).unsqueeze(0)
+        target_verify = target_model(
+            block_ids,
+            position_ids=position_ids_block,
+            past_key_values=target_kv,
+            use_cache=True,
+            output_hidden_states=True,
+        )
+        target_tokens = sample(target_verify.logits, temperature)
+        # Acceptance
+        matches = (block_ids[:, 1:actual_block_size] == target_tokens[:, :actual_block_size - 1])
+        acceptance_length = int(matches.cumprod(dim=1).sum(dim=1)[0].item())
+        output_ids[:, start:start + acceptance_length + 1] = block_ids[:, :acceptance_length + 1]
+        output_ids[:, start + acceptance_length + 1] = target_tokens[:, acceptance_length]
+        accepted_end = start + acceptance_length + 1
+        target_kv.crop(accepted_end)
+        # Remove the anchor hidden state we added above (it's position start);
+        # instead, save the verification's hidden states which include the anchor
+        # and accepted tokens computed with the correct full KV context.
+        for i in range(num_layers):
+            # Drop the anchor we appended earlier (last entry in ctx_hidden)
+            ctx_hidden_per_layer[i] = ctx_hidden_per_layer[i][:, :-1, :]
+            # Add verification hidden states for accepted positions
+            new_hidden = target_verify.hidden_states[i][:, :acceptance_length + 1, :]
+            ctx_hidden_per_layer[i] = torch.cat([ctx_hidden_per_layer[i], new_hidden], dim=1)
+        start += acceptance_length + 1
+        acceptance_lengths.append(acceptance_length + 1)
+        # Official: check ALL generated tokens
+        if stop_token_ids is not None and any(
+            sid in output_ids[:, num_input_tokens:] for sid in stop_token_ids
+        ):
+            break
+    output_ids = output_ids[:, :min(start, max_length)]
+    output_ids = output_ids[:, output_ids[0] != mask_token_id]
+    if stop_token_ids is not None:
+        stop_t = torch.tensor(stop_token_ids, device=output_ids.device)
+        stop_idx = torch.isin(output_ids[0][num_input_tokens:], stop_t).nonzero(as_tuple=True)[0]
+        if stop_idx.numel() > 0:
+            output_ids = output_ids[:, :num_input_tokens + stop_idx[0] + 1]
+    num_output_tokens = output_ids.shape[1] - num_input_tokens
+    total_decode_time = cuda_time() - decode_start
+    time_per_output_token = total_decode_time / max(num_output_tokens, 1)
+    return SimpleNamespace(
+        output_ids=output_ids,
+        num_input_tokens=num_input_tokens,
+        num_output_tokens=num_output_tokens,
+        time_to_first_token=time_to_first_token,
+        time_per_output_token=time_per_output_token,
+        acceptance_lengths=acceptance_lengths,
+    )
+# ──────────────────────────────────────────────────────────────────
+# Main
+# ──────────────────────────────────────────────────────────────────
+def parse_args():
+    p = argparse.ArgumentParser(description="Offline eval for DFlash-LoRA-Inject (aligned with official)")
+    p.add_argument("--base-model", default=BASE_MODEL)
+    p.add_argument("--adapter-root", default=ADAPTER_ROOT)
+    p.add_argument("--ckpt", default=DEFAULT_CKPT, help="Checkpoint folder name")
+    p.add_argument("--merged-path",
+                   default="/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged",
+                   help="Path to pre-merged model. If None, will merge on the fly.")
+    p.add_argument("--block-size", type=int, default=BLOCK_SIZE)
+    p.add_argument("--max-new-tokens", type=int, default=2048,
+                   help="Max new tokens per turn (official shell uses 2048)")
+    p.add_argument("--temperature", type=float, default=0.0)
+    p.add_argument("--datasets", nargs="+", default=list(OFFICIAL_TASKS.keys()),
+                   help="Benchmarks to run (default: all 10 official tasks)")
+    p.add_argument("--max-samples", type=int, default=None,
+                   help="Override max samples per dataset (None = use official per-task counts)")
+    p.add_argument("--output-dir", default=RESULT_DIR)
+    return p.parse_args()
+def main():
+    args = parse_args()
+    # Fix random seeds (matches official)
+    random.seed(0)
+    np.random.seed(0)
+    torch.manual_seed(0)
+    torch.cuda.manual_seed_all(0)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    # ── Init distributed ──
+    dist_init()
+    torch.cuda.set_device(dist_local_rank())
+    device = torch.device(f"cuda:{dist_local_rank()}")
+    print_rank0(f"Running on {dist_size()} GPU(s)")
+    # Detect flash_attn (only for target model; draft needs sdpa for custom DFlash mask)
+    installed_flash_attn = has_flash_attn()
+    target_attn_impl = "flash_attention_2" if installed_flash_attn else "sdpa"
+    draft_attn_impl = "sdpa"  # DFlash injection uses custom attention mask
+    print_rank0(f"Using attn_implementation: target={target_attn_impl}, draft={draft_attn_impl}")
+    # ── Load models ──
+    print_rank0(f"Loading target model: {args.base_model}")
+    target_model = AutoModelForCausalLM.from_pretrained(
+        args.base_model,
+        torch_dtype=torch.bfloat16,
+        attn_implementation=target_attn_impl,
+        device_map=device,
+        trust_remote_code=True,
+    )
+    target_model.eval()
+    if args.merged_path and os.path.isdir(args.merged_path):
+        print_rank0(f"Loading pre-merged draft model: {args.merged_path}")
+        draft_model = AutoModelForCausalLM.from_pretrained(
+            args.merged_path,
+            torch_dtype=torch.bfloat16,
+            attn_implementation=draft_attn_impl,
+            device_map=device,
+            trust_remote_code=True,
+        )
+    else:
+        adapter_path = os.path.join(args.adapter_root, args.ckpt)
+        print_rank0(f"Loading base + LoRA adapter: {adapter_path}")
+        draft_model = AutoModelForCausalLM.from_pretrained(
+            args.base_model,
+            torch_dtype=torch.bfloat16,
+            attn_implementation=draft_attn_impl,
+            device_map=device,
+            trust_remote_code=True,
+        )
+        draft_model = PeftModel.from_pretrained(draft_model, adapter_path)
+        draft_model = draft_model.merge_and_unload()
+    draft_model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(args.base_model, trust_remote_code=True)
+    stop_token_ids = [tokenizer.eos_token_id]
+    block_size = args.block_size
+    # ── Run benchmarks ──
+    all_results = {"model": f"dflash-lora-inject/{args.ckpt}", "block_size": block_size}
+    for dataset_name in args.datasets:
+        print_rank0(f"\n{'=' * 60}")
+        print_rank0(f"Benchmark: {dataset_name} ({dist_size()} GPUs)")
+        print_rank0(f"{'=' * 60}")
+        # Load dataset using official loader
+        dataset = load_and_process_dataset(dataset_name)
+        # Sample selection: official uses shuffle(seed=0).select()
+        max_samples = args.max_samples if args.max_samples is not None else OFFICIAL_TASKS.get(dataset_name)
+        if max_samples is not None and len(dataset) > max_samples:
+            dataset = dataset.shuffle(seed=0).select(range(max_samples))
+        print_rank0(f"Total {len(dataset)} samples, distributed across {dist_size()} GPUs")
+        responses = []
+        indices = range(dist_rank(), len(dataset), dist_size())
+        iterator = tqdm(indices, desc=f"[GPU{dist_rank()}] {dataset_name}",
+                        unit="sample", disable=not dist_is_main())
+        for idx in iterator:
+            instance = dataset[idx]
+            # Multi-turn support (matches official benchmark.py)
+            messages = []
+            for turn_index, user_content in enumerate(instance["turns"]):
+                messages.append({"role": "user", "content": user_content})
+                input_text = tokenizer.apply_chat_template(
+                    messages, tokenize=False, add_generation_prompt=True,
+                    enable_thinking=False,
+                )
+                input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
+                response = {}
+                # AR baseline: pure target-only autoregressive (no draft overhead)
+                response[1] = ar_generate(
+                    target_model=target_model,
+                    input_ids=input_ids,
+                    max_new_tokens=args.max_new_tokens,
+                    mask_token_id=MASK_TOKEN_ID,
+                    temperature=args.temperature,
+                    stop_token_ids=stop_token_ids,
+                )
+                # Speculative: DFlash-LoRA-Inject
+                response[block_size] = spec_generate_inject(
+                    target_model=target_model,
+                    draft_model=draft_model,
+                    input_ids=input_ids,
+                    max_new_tokens=args.max_new_tokens,
+                    block_size=block_size,
+                    mask_token_id=MASK_TOKEN_ID,
+                    temperature=args.temperature,
+                    stop_token_ids=stop_token_ids,
+                )
+                # Append assistant response for multi-turn context
+                spec_response = response[block_size]
+                generated_ids = spec_response.output_ids[0, spec_response.num_input_tokens:]
+                output_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
+                messages.append({"role": "assistant", "content": output_text})
+                responses.append(response)
+            if dist_is_main() and responses:
+                recent_tau = np.mean([np.mean(r[block_size].acceptance_lengths) for r in responses[-5:]])
+                iterator.set_postfix(accept_len=f"{recent_tau:.2f}")
+        # ── Gather to rank 0 (matches official) ──
+        if dist_size() > 1:
+            gathered = dist_gather(responses, dst=0)
+            if not dist_is_main():
+                continue
+            responses = list(chain(*gathered))
+        elif not dist_is_main():
+            continue
+        # ── Compute metrics (exact official formulas) ──
+        t1 = np.mean([r[1].time_per_output_token for r in responses])
+        tb = np.mean([r[block_size].time_per_output_token for r in responses])
+        speedup = t1 / tb if tb > 0 else 0
+        # Acceptance length: per-sample mean, then mean of means (official)
+        tau = np.mean([np.mean(r[block_size].acceptance_lengths) for r in responses])
+        # Histogram
+        acceptance_lengths = list(chain(*[r[block_size].acceptance_lengths for r in responses]))
+        histogram = [acceptance_lengths.count(b) / len(acceptance_lengths) for b in range(block_size + 1)]
+        print_rank0(f"\n{dataset_name} Results:")
+        print_rank0(f"  Decoding speedup: {speedup:.2f}x")
+        print_rank0(f"  Average Acceptance length: {tau:.2f}")
+        print_rank0(f"  Acceptance length histogram: {[f'{x * 100:.1f}%' for x in histogram]}")
+        print_rank0(f"  Num responses: {len(responses)}")
+        all_results[dataset_name] = {
+            "decoding_speedup": speedup,
+            "avg_accept_length": tau,
+            "acceptance_histogram": histogram,
+            "num_responses": len(responses),
+            "num_gpus": dist_size(),
+        }
+    # ── Save results ──
+    if dist_is_main():
+        os.makedirs(args.output_dir, exist_ok=True)
+        timestamp = time.strftime("%Y%m%d_%H%M%S")
+        result_file = os.path.join(
+            args.output_dir,
+            f"dflash_lora_inject_offline_{args.ckpt}_{timestamp}.json",
+        )
+        with open(result_file, "w") as f:
+            json.dump(all_results, f, indent=2)
+        print(f"\nResults saved to: {result_file}")
+if __name__ == "__main__":
+    main()

syxin_old/eval_gsm8k_humaneval_mtbench.log ADDED Viewed

	@@ -0,0 +1,81 @@

+nohup: ignoring input
+WARNING:__main__:
+*****************************************
+Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
+*****************************************
+[W324 11:41:43.200488949 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W324 11:41:43.200586722 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W324 11:41:43.267031138 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W324 11:41:43.267675225 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W324 11:41:43.279640318 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W324 11:41:43.291758156 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W324 11:41:43.328250126 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W324 11:41:43.335890706 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+Running on 8 GPU(s)
+Using attn_implementation: target=flash_attention_2, draft=sdpa
+Loading target model: /workspace/models/Qwen3-8B
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+`torch_dtype` is deprecated! Use `dtype` instead!
+Loading base + LoRA adapter: /workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_4644
+============================================================
+Benchmark: gsm8k (8 GPUs)
+============================================================
+Total 128 samples, distributed across 8 GPUs
+/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  return torch.load(io.BytesIO(b))
+gsm8k Results:
+  Decoding speedup: 1.01x
+  Average Acceptance length: 1.99
+  Acceptance length histogram: ['0.0%', '3.6%', '94.8%', '1.3%', '0.3%', '0.1%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%']
+  Num responses: 128
+============================================================
+Benchmark: humaneval (8 GPUs)
+============================================================
+Total 164 samples, distributed across 8 GPUs
+humaneval Results:
+  Decoding speedup: 0.96x
+  Average Acceptance length: 1.97
+  Acceptance length histogram: ['0.0%', '4.6%', '94.6%', '0.6%', '0.1%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%']
+  Num responses: 164
+============================================================
+Benchmark: mt-bench (8 GPUs)
+============================================================
+Total 80 samples, distributed across 8 GPUs
+mt-bench Results:
+  Decoding speedup: 0.84x
+  Average Acceptance length: 1.94
+  Acceptance length histogram: ['0.0%', '6.7%', '92.6%', '0.4%', '0.2%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%']
+  Num responses: 160
+Results saved to: /workspace/hanrui/syxin_old/Specforge/benchmarks/results/dflash_lora_inject_offline_epoch_3_step_4644_20260324_121731.json

syxin_old/eval_run.log ADDED Viewed

The diff for this file is too large to render. See raw diff

syxin_old/launch_train.sh ADDED Viewed

	@@ -0,0 +1,37 @@

+#!/bin/bash
+set -euo pipefail
+cd /workspace/hanrui/syxin_old/Specforge
+export TORCHINDUCTOR_CACHE_DIR=/workspace/hanrui/cache/compiled_kernels
+export SPECFORGE_DATA_NUM_PROC=16
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export PYTORCH_ALLOC_CONF=expandable_segments:True
+export HF_DATASETS_CACHE=/workspace/hanrui/cache/hf_datasets
+export HF_HOME=/workspace/hanrui/cache/hf_home
+torchrun --nproc_per_node=8 \
+  scripts/train_dflash_lora_inject.py \
+  --target-model-path /workspace/models/Qwen3-8B \
+  --target-model-backend hf \
+  --train-data-path /workspace/hanrui/datasets/Nemotron-CodeAlpaca-qwen3-8b-800K \
+  --output-dir outputs/qwen3-8b-sft-32gpu-v2 \
+  --block-size 16 \
+  --attention-backend additive \
+  --attn-implementation sdpa \
+  --max-length 2048 \
+  --batch-size 4 \
+  --accumulation-steps 8 \
+  --num-epochs 3 \
+  --learning-rate 5e-5 \
+  --loss-decay-gamma 7 \
+  --gradient-checkpointing \
+  --chat-template qwen \
+  --log-interval 50 \
+  --save-interval 500 \
+  --cache-dir /workspace/hanrui/cache \
+  --lora-rank 32 \
+  --lora-alpha 64 \
+  --lora-dropout 0.1 \
+  --trust-remote-code \
+  --dataloader-num-workers 0

syxin_old/launch_train_dflash_wrapper.py ADDED Viewed

	@@ -0,0 +1,17 @@

+#!/usr/bin/env python3
+"""
+Python wrapper to launch dflash training script via northjob/torchrun
+"""
+import subprocess
+import sys
+import os
+if __name__ == "__main__":
+    bash_script = "/workspace/hanrui/syxin_old/run_train_multinode_dflash.sh"
+    args = sys.argv[1:]
+    cmd = ["bash", bash_script] + args
+    result = subprocess.run(cmd, env=os.environ.copy())
+    sys.exit(result.returncode)

syxin_old/launch_train_random_anchor.py ADDED Viewed

	@@ -0,0 +1,15 @@

+#!/usr/bin/env python3
+"""
+Python wrapper to launch random anchor training script via northjob/torchrun
+"""
+import subprocess
+import sys
+import os
+if __name__ == "__main__":
+    bash_script = "/workspace/hanrui/syxin_old/run_train_multinode_random_anchor.sh"
+    args = sys.argv[1:]
+    cmd = ["bash", bash_script] + args
+    result = subprocess.run(cmd, env=os.environ.copy())
+    sys.exit(result.returncode)

syxin_old/launch_train_wrapper.py ADDED Viewed

	@@ -0,0 +1,21 @@

+#!/usr/bin/env python3
+"""
+Python wrapper to launch bash training script via torchrun
+"""
+import subprocess
+import sys
+import os
+if __name__ == "__main__":
+    # Get the bash script path and arguments
+    bash_script = "/workspace/hanrui/syxin_old/run_train_multinode.sh"
+    args = sys.argv[1:]  # Pass through all arguments
+    # Build the command
+    cmd = ["bash", bash_script] + args
+    # Execute the bash script
+    result = subprocess.run(cmd, env=os.environ.copy())
+    # Exit with the same code as the bash script
+    sys.exit(result.returncode)

syxin_old/list.md ADDED Viewed

	@@ -0,0 +1,12 @@

+### 1. `train_dflash_lora.py`
+* 加了lora，原来是调用小模型，现在是hidden states+lora预测。
+* `dflash_lora_mask_fn`函数是在处理预测的那一块草稿Block时，可以同时看到这一块里的所有词。
+### 2. OOM优化
+* 分片策略ZeRO-3，FSDP切分从`SHARD_GRAD_OP`升级到`FULL_SHARD`。
+* `batch-size=1`，`accumulation-steps=8`。
+* 参考之前的代码用了FlexAttention（`dflash_lora_mask_fn`）。
+* `_chunked_lm_loss()`，把算loss切片成256块来算+梯度检查。
+### 运行
+* bash /workspace/hanrui/junquan/SpecForge/scripts/run_train_dflash_lora.sh 2

syxin_old/merge_lora.py ADDED Viewed

	@@ -0,0 +1,66 @@

+"""
+Step 1: Merge DFlash-LoRA adapter into base model.
+Usage:
+    conda activate sglang
+    python3 merge_lora.py
+    python3 merge_lora.py --ckpt epoch_2_step_15000   # 测其他 checkpoint
+"""
+import argparse
+import os
+import torch
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+BASE_MODEL  = "/workspace/models/Qwen3-8B"
+OUTPUT_ROOT = "/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora"
+MERGE_ROOT  = "/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-merged"
+def parse_args():
+    p = argparse.ArgumentParser()
+    p.add_argument("--ckpt", default="epoch_3_step_18576",
+                   help="Checkpoint folder name under OUTPUT_ROOT")
+    p.add_argument("--merged-path", default=MERGE_ROOT,
+                   help="Where to save the merged model")
+    return p.parse_args()
+def main():
+    args = parse_args()
+    adapter_path = os.path.join(OUTPUT_ROOT, args.ckpt)
+    merged_path  = args.merged_path
+    if os.path.exists(merged_path):
+        print(f"[skip] Merged model already exists: {merged_path}")
+        return
+    assert os.path.isdir(adapter_path), f"Adapter not found: {adapter_path}"
+    print(f"Base model  : {BASE_MODEL}")
+    print(f"Adapter     : {adapter_path}")
+    print(f"Output      : {merged_path}")
+    print()
+    print("[1/4] Loading base model to CPU ...")
+    model = AutoModelForCausalLM.from_pretrained(
+        BASE_MODEL,
+        torch_dtype=torch.bfloat16,
+        device_map="cpu",
+    )
+    print("[2/4] Loading LoRA adapter ...")
+    model = PeftModel.from_pretrained(model, adapter_path)
+    print("[3/4] Merging weights ...")
+    model = model.merge_and_unload()
+    print("[4/4] Saving merged model ...")
+    os.makedirs(merged_path, exist_ok=True)
+    model.save_pretrained(merged_path, safe_serialization=True)
+    AutoTokenizer.from_pretrained(BASE_MODEL).save_pretrained(merged_path)
+    print(f"\nDone. Merged model saved to: {merged_path}")
+if __name__ == "__main__":
+    main()

syxin_old/oom_fix_progress.md ADDED Viewed

	@@ -0,0 +1,42 @@

+# DFlash LoRA OOM 修复记录
+## OOM 根因分析
+1. **SHARD_GRAD_OP (ZeRO-2)** — 每卡持有完整 Qwen3-8B 参数 (~16GB bf16)，参数未分片
+2. **SDPA + 4D additive mask** — FlashAttention 不支持 4D additive mask，fallback 到 math backend，每层 materialize 完整 attention scores (`bsz × 32heads × 2048 × 2048`)
+3. **大 vocab logits** — `[bsz, 2048, 151936]` bf16 ≈ 1.18GB，加上梯度和 boolean indexing 拷贝，峰值 ~3-4GB
+4. **机器只有 2 张 H100**，脚本默认 `NUM_GPUS=4`
+## 已完成的改动
+### 1. FSDP sharding 改为 FULL_SHARD (ZeRO-3)
+- 文件: `SpecForge/scripts/train_dflash_lora.py:347`
+- `ShardingStrategy.SHARD_GRAD_OP` → `ShardingStrategy.FULL_SHARD`
+- 效果: 参数跨卡分片，每卡省 ~8-12GB
+### 2. 降 batch-size，提高 accumulation-steps
+- 文件: `SpecForge/scripts/run_train_dflash_lora.sh`
+- `--batch-size 2` → `1`，`--accumulation-steps 4` → `8`
+- 效果: 等效 global batch size 不变，峰值显存减半
+## 待验证 / 后续优化
+- [ ] 运行时传 `bash run_train_dflash_lora.sh 2` 确保用 2 卡
+- [x] 如仍 OOM，考虑 chunked cross-entropy loss 避免大 vocab logits 全量 materialize
+- [x] 长期可探索自定义 attention kernel 支持 block-sparse mask，绕过 SDPA math fallback
+### 3. flex_attention + BlockMask 替换 4D additive mask
+- 文件: `SpecForge/specforge/core/dflash_lora.py`, `specforge/modeling/draft/dflash_lora.py`, `scripts/train_dflash_lora.py`
+- 从非 LoRA 版 `dflash.py` 移植 `_get_or_create_block_mask()` 方法，适配 LoRA 场景 (Q_LEN == KV_LEN == seq_len)
+- LoRA 版 mask: context causal + block bidirectional (非 LoRA 版是 [context, noise] concat KV)
+- 用 `--attention-backend flex_attention` 启用 (默认)，退回 `--attention-backend additive` 走原有 4D mask
+- HuggingFace model 用 `attn_implementation="flex_attention"` 加载
+- 效果: 不再 fallback 到 SDPA math backend，省去 `[bsz, heads, seq, seq]` attention scores 的显存
+### 4. chunked cross-entropy loss
+- 文件: `SpecForge/specforge/core/dflash_lora.py`, `specforge/modeling/draft/dflash_lora.py`, `scripts/train_dflash_lora.py`
+- 从非 LoRA 版 `dflash.py` 移植 `_chunked_lm_loss()` 方法
+- 分 chunk 过 lm_head + CE loss + gradient checkpointing，避免 materialize 完整 `[bsz, seq, vocab]` logits
+- 用 `--lm-head-chunk-size 256` 启用 (默认 0 = 不启用)
+- `DFlashLoRADraftModel.forward()` 新增 `output_hidden_states` 参数，chunked 时返回 hidden states
+- 效果: logits 峰值显存从 O(seq_len × vocab_size) 降至 O(chunk_size × vocab_size)

syxin_old/random_anchor_plan.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Plan: Add Random Anchor to DFlash LoRA-Inject (syxin_old)
+## Context
+LoRA-Inject τ≈3.9 vs 原始 DFlash τ≈6.5。代码无 bug，差距来自训练策略：原始 DFlash 用 `--random-anchor --num-anchors 512` 每步随机采样 block 起点，LoRA-Inject 只用固定边界。
+## 要改的文件（全在 syxin_old）
+1. `/workspace/hanrui/syxin_old/Specforge/specforge/core/dflash_lora_inject.py` — training wrapper
+2. `/workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/dflash_lora_inject.py` — draft model
+3. `/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py` — 训练脚本参数
+4. `/workspace/hanrui/syxin_old/run_train_dflash_lora_inject.sh` — 启动脚本
+## 参考代码（同在 syxin_old）
+- `/workspace/hanrui/syxin_old/Specforge/specforge/core/dflash_lora.py` line 73-146: `_sample_anchor_positions`, `_build_blocks_from_anchors`
+- `/workspace/hanrui/syxin_old/Specforge/specforge/core/dflash_lora.py` line 219-235: `_build_additive_mask_random_anchor`
+- `/workspace/hanrui/syxin_old/Specforge/specforge/core/dflash_lora.py` line 384-408: `_compute_loss_weights_random_anchor`
+- `/workspace/hanrui/syxin_old/Specforge/specforge/core/dflash_lora.py` line 529-578: random anchor forward 路径
+## Step 1: training wrapper `core/dflash_lora_inject.py`
+添加 4 个方法 + 修改 forward：
+**1a. `_sample_anchor_positions()`** — 直接从 dflash_lora.py 复制，完全相同
+**1b. `_build_blocks_from_anchors()`** — 从 dflash_lora.py 复制，**额外** gather target_layer_hidden_states（List[Tensor] 每层一个）：
+```python
+block_target_hidden = []
+for layer_hs in target_layer_hidden_states:
+    gathered = torch.gather(layer_hs, 1,
+        gather_idx.unsqueeze(-1).expand(-1, -1, layer_hs.size(-1)))
+    block_target_hidden.append(gathered)
+```
+返回多一个 `block_target_hidden_states`
+**1c. `_build_additive_mask_random_anchor()`** — 直接从 dflash_lora.py 复制（line 219-235），完全相同。这是 draft-to-draft 的 mask（同 block 双向），会在 draft model 的 `_forward_with_injection` 中被扩展为 extended mask。
+**1d. `_compute_loss_weights_random_anchor()`** — 直接从 dflash_lora.py 复制（line 384-408），完全相同
+**1e. 修改 `forward()`** — 在 target model forward 之后插入 random anchor 分支：
+```python
+if self.random_anchor and self.training:
+    # 1. sample anchors
+    # 2. build blocks (input_ids, loss_mask, target hidden per layer)
+    # 3. prepare_noise_input(block_ids=block_ids)
+    # 4. build draft-draft mask
+    # 5. position_ids = gather_idx (原始序列位置!)
+    # 6. draft_model.forward(... block_ids=block_ids)  # 新参数
+    # 7. loss + accuracy
+    return loss, accuracy
+```
+## Step 2: draft model `modeling/draft/dflash_lora_inject.py`
+**修改 `forward()` 和 `_forward_with_injection()`**，添加 `block_ids` 参数。
+当前 `_forward_with_injection` 的问题（random anchor 模式下）：
+1. **Position IDs** (line 237-238): 用 `[0..seq_len-1, 0..seq_len-1]`，但 random anchor 需要用 `[gather_idx, gather_idx]`（原始序列位置）
+2. **Extended mask** (line 269-274): 用固定 `context_len` 和 `block_size` 算 leakage prevention，但 random anchor 的 block 边界由 `block_ids` 决定
+修改方案：
+- `forward()` 加 `block_ids=None` 参数，传递给 `_forward_with_injection`
+- `_forward_with_injection` 加 `block_ids=None` 参数
+- 当 `block_ids is not None`：
+  - position 使用调用方传入的 `position_ids`，构造 `extended_pos = cat([position_ids, position_ids])`（gather_idx 的位置）
+  - extended mask 的 draft-to-target 部分：block k 的 draft token 只能看 block < k 的 target token（用 block_ids 判断：`block_ids[target_pos] < block_ids[draft_pos]`，per-sample）
+  - draft-to-draft 部分：使用传入的 `attention_mask`（已由 wrapper 构建好的同 block 双向 mask）
+## Step 3: 训练脚本
+**3a. `scripts/train_dflash_lora_inject.py`**
+- 添加 argparse 参数 `--random-anchor` (store_true) 和 `--num-anchors` (default=512)
+- line 204: `random_anchor=False` → `random_anchor=args.random_anchor`
+- line 205: `num_anchors=512` → `num_anchors=args.num_anchors`
+**3b. `run_train_dflash_lora_inject.sh`**
+- 添加 `--random-anchor --num-anchors 512`
+- 建议提高 lr 到 6e-4，epoch 到 6（对齐原始 DFlash）
+## 验证
+1. 跑几步训练确认 loss 正常下降
+2. 对比 dflash_lora.py random anchor 路径确认 mask/loss/position 逻辑一致
+3. 完整训练后重新 eval

syxin_old/requirements.txt ADDED Viewed

File without changes

syxin_old/run_bench_dflash.sh ADDED Viewed

	@@ -0,0 +1,71 @@

+#!/bin/bash
+# Evaluate DFlash-LoRA-Inject accepted length (offline, 8 GPUs parallel).
+# No sglang server needed. Each GPU loads its own target+draft and processes a shard.
+#
+# Usage:
+#   bash run_bench_dflash.sh                        # 8 GPUs, all 3 benches
+#   bash run_bench_dflash.sh humaneval              # only humaneval
+#   bash run_bench_dflash.sh mtbench gsm8k          # pick any subset
+#   bash run_bench_dflash.sh --quick                # quick test (20 samples)
+#   bash run_bench_dflash.sh --ckpt epoch_0_step_500  # specific checkpoint
+#   NUM_GPUS=4 bash run_bench_dflash.sh             # use 4 GPUs
+set -e
+SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
+PYTHON=/workspace/miniconda3/envs/spec/bin/python3
+RESULT_DIR=/workspace/hanrui/syxin_old/Specforge/benchmarks/results
+NUM_GPUS=${NUM_GPUS:-8}
+# ---- parse args ----
+BENCHMARKS=()
+EXTRA_ARGS=()
+QUICK=false
+for arg in "$@"; do
+    case $arg in
+        humaneval|mtbench|gsm8k)
+            BENCHMARKS+=("$arg")
+            ;;
+        --quick)
+            QUICK=true
+            ;;
+        *)
+            EXTRA_ARGS+=("$arg")
+            ;;
+    esac
+done
+if [ ${#BENCHMARKS[@]} -eq 0 ]; then
+    BENCHMARKS=(humaneval mtbench gsm8k)
+fi
+if [ "$QUICK" = true ]; then
+    EXTRA_ARGS+=(--num-samples 20)
+fi
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+echo "============================================"
+echo " DFlash-LoRA-Inject Offline Eval"
+echo " GPUs       : $NUM_GPUS"
+echo " benchmarks : ${BENCHMARKS[*]}"
+echo " extra args : ${EXTRA_ARGS[*]}"
+echo " results    : $RESULT_DIR"
+echo "============================================"
+echo ""
+mkdir -p $RESULT_DIR
+$PYTHON -m torch.distributed.run \
+    --standalone \
+    --nproc_per_node $NUM_GPUS \
+    $SCRIPT_DIR/eval_dflash_lora_inject.py \
+    --benchmarks ${BENCHMARKS[@]} \
+    --output-dir $RESULT_DIR \
+    "${EXTRA_ARGS[@]}" \
+    2>&1 | tee $RESULT_DIR/bench_dflash_lora_inject_offline_${TIMESTAMP}.log
+echo ""
+echo "Done. Latest result files:"
+ls -lht $RESULT_DIR/*.json 2>/dev/null | head -5

syxin_old/run_bench_dflash_b16_baseline.sh ADDED Viewed

	@@ -0,0 +1,60 @@

+#!/bin/bash
+# DFlash-b16 baseline: measure accepted length offline, 8 GPUs parallel.
+# Usage:
+#   bash run_bench_dflash_b16_baseline.sh                  # 8 GPUs, all 3 benches
+#   bash run_bench_dflash_b16_baseline.sh humaneval         # only humaneval
+#   bash run_bench_dflash_b16_baseline.sh --quick           # 20 samples per bench
+#   NUM_GPUS=4 bash run_bench_dflash_b16_baseline.sh       # 4 GPUs
+set -e
+SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
+PYTHON=/workspace/miniconda3/envs/spec/bin/python3
+RESULT_DIR=/workspace/hanrui/syxin_old/Specforge/benchmarks/results
+NUM_GPUS=${NUM_GPUS:-8}
+BENCHMARKS=()
+EXTRA_ARGS=()
+QUICK=false
+for arg in "$@"; do
+    case $arg in
+        humaneval|mtbench|gsm8k) BENCHMARKS+=("$arg") ;;
+        --quick) QUICK=true ;;
+        *) EXTRA_ARGS+=("$arg") ;;
+    esac
+done
+if [ ${#BENCHMARKS[@]} -eq 0 ]; then
+    BENCHMARKS=(humaneval mtbench gsm8k)
+fi
+if [ "$QUICK" = true ]; then
+    EXTRA_ARGS+=(--num-samples 20)
+fi
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+echo "============================================"
+echo " DFlash-b16 Baseline Offline Eval"
+echo " GPUs       : $NUM_GPUS"
+echo " draft      : /workspace/models/Qwen3-8B-DFlash-b16"
+echo " benchmarks : ${BENCHMARKS[*]}"
+echo " extra args : ${EXTRA_ARGS[*]}"
+echo "============================================"
+echo ""
+mkdir -p $RESULT_DIR
+$PYTHON -m torch.distributed.run \
+    --standalone \
+    --nproc_per_node $NUM_GPUS \
+    $SCRIPT_DIR/eval_dflash_b16_baseline.py \
+    --benchmarks ${BENCHMARKS[@]} \
+    --output-dir $RESULT_DIR \
+    "${EXTRA_ARGS[@]}" \
+    2>&1 | tee $RESULT_DIR/bench_dflash_b16_baseline_${TIMESTAMP}.log
+echo ""
+echo "Done. Latest result files:"
+ls -lht $RESULT_DIR/*.json 2>/dev/null | head -5

syxin_old/run_bench_dflash_lora_inject.sh ADDED Viewed

	@@ -0,0 +1,60 @@

+#!/bin/bash
+# DFlash-LoRA-Inject: measure accepted length offline, 8 GPUs parallel.
+# Usage:
+#   bash run_bench_dflash_lora_inject.sh                  # 8 GPUs, all 3 benches
+#   bash run_bench_dflash_lora_inject.sh humaneval         # only humaneval
+#   bash run_bench_dflash_lora_inject.sh --quick           # 20 samples per bench
+#   NUM_GPUS=4 bash run_bench_dflash_lora_inject.sh       # 4 GPUs
+set -e
+SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
+PYTHON=/workspace/miniconda3/envs/dflash/bin/python3
+RESULT_DIR=/workspace/hanrui/syxin_old/Specforge/benchmarks/results
+NUM_GPUS=${NUM_GPUS:-8}
+BENCHMARKS=()
+EXTRA_ARGS=()
+QUICK=false
+for arg in "$@"; do
+    case $arg in
+        humaneval|mtbench|gsm8k) BENCHMARKS+=("$arg") ;;
+        --quick) QUICK=true ;;
+        *) EXTRA_ARGS+=("$arg") ;;
+    esac
+done
+if [ ${#BENCHMARKS[@]} -eq 0 ]; then
+    BENCHMARKS=(humaneval mtbench gsm8k)
+fi
+if [ "$QUICK" = true ]; then
+    EXTRA_ARGS+=(--num-samples 20)
+fi
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+echo "============================================"
+echo " DFlash-LoRA-Inject Offline Eval"
+echo " GPUs       : $NUM_GPUS"
+echo " draft      : LoRA-inject (merged)"
+echo " benchmarks : ${BENCHMARKS[*]}"
+echo " extra args : ${EXTRA_ARGS[*]}"
+echo "============================================"
+echo ""
+mkdir -p $RESULT_DIR
+$PYTHON -m torch.distributed.run \
+    --standalone \
+    --nproc_per_node $NUM_GPUS \
+    $SCRIPT_DIR/eval_dflash_lora_inject.py \
+    --benchmarks ${BENCHMARKS[@]} \
+    --output-dir $RESULT_DIR \
+    "${EXTRA_ARGS[@]}" \
+    2>&1 | tee $RESULT_DIR/bench_dflash_lora_inject_${TIMESTAMP}.log
+echo ""
+echo "Done. Latest result files:"
+ls -lht $RESULT_DIR/*.json 2>/dev/null | head -5

syxin_old/run_qwen3_8b_sft_64gpu.sh ADDED Viewed

	@@ -0,0 +1,31 @@

+#!/bin/bash
+export JOB_NAME='qwen3-32b-sft'
+export GPU_NUMS=32
+export TRAIN_SCRIPT='/workspace/hanrui/syxin_old/launch_train_dflash_wrapper.py'
+export WORK_DIR='/workspace/hanrui/syxin_old/Specforge'
+if [ $GPU_NUMS -lt 8 ]; then
+    export NNODES=1
+    export GPU_NUMS_PER_NODE=$GPU_NUMS
+else
+    export NNODES=$((GPU_NUMS/8))
+    export GPU_NUMS_PER_NODE=8
+fi
+# 使用 spec 环境的 northjob
+/workspace/miniconda3/envs/spec/bin/northjob \
+create \
+--job-type train \
+--nproc-per-node $GPU_NUMS_PER_NODE \
+--gpu-per-node $GPU_NUMS_PER_NODE \
+--nnodes $NNODES \
+--k8s-priority 3 \
+--k8s-queue bg-agentic-coding \
+--k8s-namespace bg-agentic-coding \
+--k8s-pvc-name i-xinsiyang-y4zy0sik0a \
+--k8s-pvc-mount-path /workspace \
+--k8s-no-reclaim \
+--k8s-images harbor.local.clusters/bp/megatron-bplm:25.03_fp8.ibgda.qwen3.next.fix_triton.fix_te.hf457.qwen3_vl \
+--job-name $JOB_NAME \
+--workspace $WORK_DIR \
+$TRAIN_SCRIPT $GPU_NUMS_PER_NODE

syxin_old/run_train_dflash_lora_inject.sh ADDED Viewed

	@@ -0,0 +1,73 @@

+#!/bin/bash
+set -euo pipefail
+ROOT_DIR=/workspace/hanrui/syxin_old/Specforge
+NUM_GPUS=8
+OUTPUT_DIR=$ROOT_DIR/outputs/qwen3-8b-dflash-lora-inject-random-anchor
+CACHE_DIR=/tmp/specforge_cache
+# Parse arguments
+if [[ $# -ge 1 ]]; then
+  NUM_GPUS=$1
+  shift
+fi
+if [[ $# -ge 1 && "${1:0:1}" != "-" ]]; then
+  OUTPUT_DIR=$1
+  shift
+fi
+EXTRA_ARGS=("$@")
+# Environment variables
+export TORCHINDUCTOR_CACHE_DIR=/tmp/specforge_cache/compiled_kernels
+export SPECFORGE_DATA_NUM_PROC=16
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export PYTORCH_ALLOC_CONF=expandable_segments:True
+export PYTHONPATH="$ROOT_DIR:${PYTHONPATH:-}"
+export HF_DATASETS_CACHE=/tmp/specforge_cache/hf_datasets
+export HF_HOME=/tmp/specforge_cache/hf_home
+# Python binary
+DEFAULT_SPECFORGE_PY=/workspace/hanrui/specforge/bin/python3
+if [[ -z "${PYTHON_BIN:-}" ]]; then
+  if [[ -x "$DEFAULT_SPECFORGE_PY" ]]; then
+    PYTHON_BIN="$DEFAULT_SPECFORGE_PY"
+  else
+    PYTHON_BIN=python3
+  fi
+fi
+cd $ROOT_DIR
+$PYTHON_BIN -m torch.distributed.run \
+  --standalone \
+  --nproc_per_node $NUM_GPUS \
+  scripts/train_dflash_lora_inject.py \
+  --target-model-path /workspace/models/Qwen3-8B \
+  --target-model-backend hf \
+  --train-data-path /workspace/hanrui/datasets/Nemotron-CodeAlpaca-qwen3-8b-800K \
+  --output-dir $OUTPUT_DIR \
+  --block-size 16 \
+  --attention-backend additive \
+  --attn-implementation sdpa \
+  --random-anchor \
+  --num-anchors 64 \
+  --max-length 2048 \
+  --batch-size 1 \
+  --accumulation-steps 64 \
+  --num-epochs 6 \
+  --learning-rate 6e-4 \
+  --loss-decay-gamma 7 \
+  --gradient-checkpointing \
+  --chat-template qwen \
+  --log-interval 50 \
+  --save-interval 500 \
+  --cache-dir $CACHE_DIR \
+  --lora-rank 32 \
+  --lora-alpha 64 \
+  --lora-dropout 0.1 \
+  --trust-remote-code \
+  --dataloader-num-workers 0 \
+  --early-stop \
+  --early-stop-patience 5 \
+  --early-stop-min-delta 0.005 \
+  "${EXTRA_ARGS[@]}"

syxin_old/run_train_multinode.sh ADDED Viewed

	@@ -0,0 +1,67 @@

+#!/bin/bash
+set -euo pipefail
+ROOT_DIR=/workspace/hanrui/syxin_old/Specforge
+NUM_GPUS=8
+OUTPUT_DIR=$ROOT_DIR/outputs/qwen3-8b-dflash-lora-inject
+CACHE_DIR=/tmp/specforge_cache
+# Parse arguments
+if [[ $# -ge 1 ]]; then
+  NUM_GPUS=$1
+  shift
+fi
+if [[ $# -ge 1 && "${1:0:1}" != "-" ]]; then
+  OUTPUT_DIR=$1
+  shift
+fi
+EXTRA_ARGS=("$@")
+# Environment variables
+export TORCHINDUCTOR_CACHE_DIR=/tmp/specforge_cache/compiled_kernels
+export SPECFORGE_DATA_NUM_PROC=16
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export PYTORCH_ALLOC_CONF=expandable_segments:True
+export PYTHONPATH="$ROOT_DIR:${PYTHONPATH:-}"
+export HF_DATASETS_CACHE=/tmp/specforge_cache/hf_datasets
+export HF_HOME=/tmp/specforge_cache/hf_home
+# Python binary
+DEFAULT_SPECFORGE_PY=/workspace/miniconda3/envs/spec/bin/python3
+if [[ -z "${PYTHON_BIN:-}" ]]; then
+  if [[ -x "$DEFAULT_SPECFORGE_PY" ]]; then
+    PYTHON_BIN="$DEFAULT_SPECFORGE_PY"
+  else
+    PYTHON_BIN=python3
+  fi
+fi
+cd $ROOT_DIR
+# northjob 已经通过 torchrun 设置了分布式环境变量
+# 直接运行训练脚本，不要再启动 torch.distributed.run
+$PYTHON_BIN scripts/train_dflash_lora_inject.py \
+  --target-model-path /workspace/models/Qwen3-8B \
+  --target-model-backend hf \
+  --train-data-path /workspace/hanrui/datasets/Nemotron-CodeAlpaca-qwen3-8b-800K \
+  --output-dir $OUTPUT_DIR \
+  --block-size 16 \
+  --attention-backend additive \
+  --attn-implementation sdpa \
+  --max-length 2048 \
+  --batch-size 8 \
+  --accumulation-steps 8 \
+  --num-epochs 3 \
+  --learning-rate 5e-5 \
+  --loss-decay-gamma 7 \
+  --gradient-checkpointing \
+  --chat-template qwen \
+  --log-interval 50 \
+  --save-interval 500 \
+  --cache-dir $CACHE_DIR \
+  --lora-rank 32 \
+  --lora-alpha 64 \
+  --lora-dropout 0.1 \
+  --trust-remote-code \
+  --dataloader-num-workers 0 \
+  "${EXTRA_ARGS[@]}"

syxin_old/run_train_multinode_random_anchor.sh ADDED Viewed

	@@ -0,0 +1,72 @@

+#!/bin/bash
+set -euo pipefail
+ROOT_DIR=/workspace/hanrui/syxin_old/Specforge
+NUM_GPUS=8
+OUTPUT_DIR=$ROOT_DIR/outputs/qwen3-8b-dflash-lora-inject-random-anchor
+CACHE_DIR=/tmp/specforge_cache
+# Parse arguments
+if [[ $# -ge 1 ]]; then
+  NUM_GPUS=$1
+  shift
+fi
+if [[ $# -ge 1 && "${1:0:1}" != "-" ]]; then
+  OUTPUT_DIR=$1
+  shift
+fi
+EXTRA_ARGS=("$@")
+# Environment variables
+export TORCHINDUCTOR_CACHE_DIR=/tmp/specforge_cache/compiled_kernels
+export SPECFORGE_DATA_NUM_PROC=16
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export PYTORCH_ALLOC_CONF=expandable_segments:True
+export PYTHONPATH="$ROOT_DIR:${PYTHONPATH:-}"
+export HF_DATASETS_CACHE=/tmp/specforge_cache/hf_datasets
+export HF_HOME=/tmp/specforge_cache/hf_home
+# Python binary
+DEFAULT_SPECFORGE_PY=/workspace/miniconda3/envs/spec/bin/python3
+if [[ -z "${PYTHON_BIN:-}" ]]; then
+  if [[ -x "$DEFAULT_SPECFORGE_PY" ]]; then
+    PYTHON_BIN="$DEFAULT_SPECFORGE_PY"
+  else
+    PYTHON_BIN=python3
+  fi
+fi
+cd $ROOT_DIR
+# northjob 已经通过 torchrun 设置了分布式环境变量
+# 直接运行训练脚本，不要再启动 torch.distributed.run
+$PYTHON_BIN scripts/train_dflash_lora_inject.py \
+  --target-model-path /workspace/models/Qwen3-8B \
+  --target-model-backend hf \
+  --train-data-path /workspace/hanrui/datasets/Nemotron-CodeAlpaca-qwen3-8b-800K \
+  --output-dir $OUTPUT_DIR \
+  --block-size 16 \
+  --attention-backend additive \
+  --attn-implementation sdpa \
+  --random-anchor \
+  --num-anchors 64 \
+  --max-length 2048 \
+  --batch-size 1 \
+  --accumulation-steps 64 \
+  --num-epochs 6 \
+  --learning-rate 6e-4 \
+  --loss-decay-gamma 7 \
+  --gradient-checkpointing \
+  --chat-template qwen \
+  --log-interval 50 \
+  --save-interval 500 \
+  --cache-dir $CACHE_DIR \
+  --lora-rank 32 \
+  --lora-alpha 64 \
+  --lora-dropout 0.1 \
+  --trust-remote-code \
+  --dataloader-num-workers 0 \
+  --early-stop \
+  --early-stop-patience 5 \
+  --early-stop-min-delta 0.005 \
+  "${EXTRA_ARGS[@]}"

syxin_old/start_server.sh ADDED Viewed

	@@ -0,0 +1,42 @@

+#!/bin/bash
+# Step 2: Launch SGLang server with STANDALONE speculative decoding.
+# Usage:
+#   bash start_server.sh
+#   bash start_server.sh 8   # use tp=8
+set -e
+TP=${1:-2}
+BASE_MODEL=/workspace/models/Qwen3-8B
+MERGED=/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-merged
+INTRANET_IP=10.1.1.131
+PORT=30000
+if [ ! -d "$MERGED" ]; then
+    echo "[ERROR] Merged model not found: $MERGED"
+    echo "        Run: conda activate sglang && python3 merge_lora.py"
+    exit 1
+fi
+echo "============================================"
+echo " SGLang STANDALONE Speculative Decoding"
+echo " target : $BASE_MODEL"
+echo " draft  : $MERGED"
+echo " host   : $INTRANET_IP:$PORT"
+echo " tp     : $TP"
+echo "============================================"
+/workspace/miniconda3/envs/sglang/bin/python3 -m sglang.launch_server \
+    --model-path                    $BASE_MODEL \
+    --speculative-algorithm         STANDALONE \
+    --speculative-draft-model-path  $MERGED \
+    --speculative-num-steps         4 \
+    --speculative-eagle-topk        1 \
+    --speculative-num-draft-tokens  4 \
+    --tp-size                       $TP \
+    --mem-fraction-static           0.30 \
+    --trust-remote-code \
+    --host                          $INTRANET_IP \
+    --port                          $PORT \
+    --dtype                         bfloat16

syxin_old/start_server_dflash.sh ADDED Viewed

	@@ -0,0 +1,54 @@

+#!/bin/bash
+# Evaluate DFlash-LoRA-Inject: measure accepted length OFFLINE.
+# 8 GPUs parallel by default, each GPU runs a shard of prompts independently.
+#
+# WHY offline?
+#   sglang STANDALONE treats draft as an independent autoregressive model,
+#   completely ignoring the layer-by-layer injection that LoRA-Inject was
+#   trained with. Result: accept_length ≈ 4.7 for ALL models (no signal).
+#
+#   sglang DFLASH expects the DFlash-b16 architecture (5-layer, fc+hidden_norm),
+#   which is structurally different from LoRA-Inject (full 36-layer + LoRA).
+#
+#   So we run offline spec-generate with the correct injection pattern.
+#
+# Usage:
+#   bash start_server_dflash.sh                    # 8 GPUs, all benchmarks
+#   bash start_server_dflash.sh 4                  # 4 GPUs
+#   bash start_server_dflash.sh 8 humaneval        # specific benchmark
+#   bash start_server_dflash.sh 8 --num-samples 20 # quick test
+set -e
+SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
+NUM_GPUS=${1:-8}
+shift 2>/dev/null || true
+# ---- defaults ----
+BASE_MODEL=/workspace/models/Qwen3-8B
+ADAPTER_ROOT=/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject
+CKPT=epoch_3_step_1400
+MERGED=/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged
+RESULT_DIR=/workspace/hanrui/syxin_old/Specforge/benchmarks/results
+PYTHON=/workspace/miniconda3/envs/spec/bin/python3
+echo "============================================"
+echo " DFlash-LoRA-Inject Offline Evaluation"
+echo " target : $BASE_MODEL"
+echo " ckpt   : $CKPT"
+echo " merged : $MERGED"
+echo " GPUs   : $NUM_GPUS"
+echo "============================================"
+$PYTHON -m torch.distributed.run \
+    --standalone \
+    --nproc_per_node $NUM_GPUS \
+    $SCRIPT_DIR/eval_dflash_lora_inject.py \
+    --base-model   $BASE_MODEL \
+    --adapter-root $ADAPTER_ROOT \
+    --ckpt         $CKPT \
+    --merged-path  $MERGED \
+    --block-size   16 \
+    --output-dir   $RESULT_DIR \
+    "$@"