Update AGENTS.md: document how hf_jobs script parameter actually works (converts to raw Hub URL)
Browse files
AGENTS.md
CHANGED
|
@@ -176,26 +176,54 @@ snapshot_download(
|
|
| 176 |
# snapshot_download auto-uses HF_TOKEN from environment
|
| 177 |
```
|
| 178 |
|
| 179 |
-
### Script Submission Pattern
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
```python
|
| 182 |
-
|
|
|
|
|
|
|
| 183 |
write(path="/app/train.py", content="...")
|
| 184 |
|
| 185 |
-
# Step 2:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
hf_jobs(
|
| 187 |
operation="run",
|
| 188 |
-
script="/app/train.py",
|
| 189 |
dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
|
| 190 |
"numpy", "huggingface_hub", "pygame", "omegaconf",
|
| 191 |
"mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
|
| 192 |
hardware_flavor="a10g-small",
|
| 193 |
timeout="6h",
|
| 194 |
-
namespace="E-Rong"
|
| 195 |
)
|
| 196 |
```
|
| 197 |
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
### Job Persistence
|
| 201 |
- Jobs run on HF infrastructure, not in your sandbox
|
|
@@ -216,7 +244,7 @@ The `script` parameter is a **sandbox file path** that gets uploaded to the job
|
|
| 216 |
| `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
|
| 217 |
| `phase2_final.zip` | Phase 2 complete model (when done) |
|
| 218 |
| `ae_manager.py` | Inference code for the evaluation server |
|
| 219 |
-
| `
|
| 220 |
| `smoke_test.py` | 5-minute validation job — test before any real job |
|
| 221 |
| `train_all_phases.py` | Original training script |
|
| 222 |
|
|
|
|
| 176 |
# snapshot_download auto-uses HF_TOKEN from environment
|
| 177 |
```
|
| 178 |
|
| 179 |
+
### Script Submission Pattern (What Actually Works)
|
| 180 |
+
|
| 181 |
+
**⚠️ CRITICAL DISCOVERY: The `script` parameter in `hf_jobs` becomes a RAW HUB URL.**
|
| 182 |
+
|
| 183 |
+
When you call `hf_jobs(script="/app/train.py")`, the job system does NOT upload the local file. Instead, it converts the path to:
|
| 184 |
+
```
|
| 185 |
+
https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/train.py
|
| 186 |
+
```
|
| 187 |
+
and runs it via `uv run <url>`. **This means the file MUST already exist on the Hub repo.**
|
| 188 |
+
|
| 189 |
+
**The correct workflow is:**
|
| 190 |
|
| 191 |
```python
|
| 192 |
+
from tools import write, hf_repo_files, hf_jobs
|
| 193 |
+
|
| 194 |
+
# Step 1: Write script to sandbox file
|
| 195 |
write(path="/app/train.py", content="...")
|
| 196 |
|
| 197 |
+
# Step 2: ALSO upload to Hub repo so it's persisted and URL-accessible
|
| 198 |
+
hf_repo_files(
|
| 199 |
+
operation="upload",
|
| 200 |
+
repo_id="E-Rong/til-26-ae-agent",
|
| 201 |
+
path="train.py",
|
| 202 |
+
content=open("/app/train.py").read()
|
| 203 |
+
)
|
| 204 |
+
|
| 205 |
+
# Step 3: Submit job referencing the sandbox path
|
| 206 |
+
# The job system will convert this to a Hub raw URL under the hood
|
| 207 |
hf_jobs(
|
| 208 |
operation="run",
|
| 209 |
+
script="/app/train.py", # ← sandbox file path
|
| 210 |
dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
|
| 211 |
"numpy", "huggingface_hub", "pygame", "omegaconf",
|
| 212 |
"mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
|
| 213 |
hardware_flavor="a10g-small",
|
| 214 |
timeout="6h",
|
| 215 |
+
namespace="E-Rong" # ← bills to org
|
| 216 |
)
|
| 217 |
```
|
| 218 |
|
| 219 |
+
**Verification from `hf_jobs inspect`:**
|
| 220 |
+
```bash
|
| 221 |
+
exec uv run --with torch --with sb3-contrib ... \
|
| 222 |
+
https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/phase2_resume.py
|
| 223 |
+
```
|
| 224 |
+
The job fetches the script from the Hub, not from the sandbox. The sandbox path is just used to derive the repo/file path.
|
| 225 |
+
|
| 226 |
+
**Why this matters**: If you only write to `/app/train.py` and don't upload to the Hub, the job will fail with a 404 when it tries to fetch the URL. The sandbox resets, but the Hub URL is permanent.
|
| 227 |
|
| 228 |
### Job Persistence
|
| 229 |
- Jobs run on HF infrastructure, not in your sandbox
|
|
|
|
| 244 |
| `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
|
| 245 |
| `phase2_final.zip` | Phase 2 complete model (when done) |
|
| 246 |
| `ae_manager.py` | Inference code for the evaluation server |
|
| 247 |
+
| `phase2_resume.py` | Latest HF Job script (works — uses snapshot_download) |
|
| 248 |
| `smoke_test.py` | 5-minute validation job — test before any real job |
|
| 249 |
| `train_all_phases.py` | Original training script |
|
| 250 |
|