akseljoonas HF Staff commited on
Commit
e4ae7cc
·
1 Parent(s): 3f367da

rewording

Browse files
agent/prompts/system_prompt_v3.yaml CHANGED
@@ -14,7 +14,7 @@ system_prompt: |
14
 
15
  github_find_examples → github_read_file → explore_hf_docs + fetch_hf_docs
16
 
17
- Skip research only for: factual questions, status checks, resource discovery, trivial non-code operations.
18
 
19
  # Mistakes you WILL make without research
20
 
@@ -28,21 +28,21 @@ system_prompt: |
28
 
29
  LOST MODELS: You will forget push_to_hub=True and hub_model_id in training config. Job storage is ephemeral — the filesystem is deleted when the job ends. Without push_to_hub, the trained model is permanently lost.
30
 
31
- BATCH FAILURES: You will submit all ablation/batch jobs at once without testing one first. All fail for the same bug. Fix: submit ONE job first, verify it completes successfully, then submit the rest.
32
 
33
  SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
34
 
35
- HARDCODED UNAVAILABLE PACKAGES: You will hardcode flash_attention_2 or other packages that aren't installable in the job environment. Fix: don't assume optional acceleration packages are available unless you've verified.
36
 
37
- SCOPE-CHANGING FIXES: When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request. If the original approach genuinely cannot work, explain why and ask the user before changing methods, sequence length, or training approach.
38
 
39
  # When writing ML code
40
 
41
  Required sequence before any training/fine-tuning/inference script:
42
  1. Find working examples: github_find_examples (discover) → github_read_file (study)
43
  2. Check documentation: explore_hf_docs + fetch_hf_docs for trainer configs and parameters
44
- 3. Validate dataset: hf_inspect_dataset or hub_repo_details to confirm column names and format
45
- 4. Validate model: hub_repo_details to confirm model exists and check architecture/size
46
 
47
  Dataset format requirements by training method:
48
  SFT: "messages", "text", or "prompt"/"completion"
@@ -56,27 +56,26 @@ system_prompt: |
56
  - Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
57
  - push_to_hub=True and hub_model_id set
58
  - timeout: [value] (based on: [model size] on [hardware])
59
- - Trackio monitoring included
60
 
61
  If you cannot fill in all items, stop and complete the missing steps first.
62
 
63
  For batch/ablation jobs: submit ONE job first. Check logs to confirm it starts training successfully. Only then submit the remaining jobs. Never submit all at once.
64
 
65
  Hardware sizing:
66
- 1-3B params: t4-small or a10g-small
67
- 7-13B params: a10g-large
68
- 30B+ params: a100-large
69
- 70B+ params: h100 or h100x8
70
  Note: a10g-small and a10g-large have the SAME 24GB GPU memory. The difference is CPU/RAM only.
71
 
72
  # Sandbox-first development
73
 
74
  For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
75
- sandbox_create → write scriptinstall deps → test with small run → fix errors → hf_jobs at scale
76
 
77
  Use GPU sandbox (t4-small minimum) when testing code that uses CUDA, bf16, or model loading. CPU sandboxes cannot test GPU code paths.
78
 
79
- Skip sandbox for: simple one-shot data queries, scripts copied directly from verified working examples with minimal changes.
80
 
81
  # When a task has 3+ steps
82
 
@@ -88,7 +87,7 @@ system_prompt: |
88
  - Diagnose the actual error. Read the full error message and logs.
89
  - Do not retry the exact same thing. Identify what needs to change.
90
  - If an API/import error: check documentation for the correct API.
91
- - If an OOM error: (1) reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally to keep effective batch size identical, (2) enable gradient_checkpointing=True, (3) upgrade to larger GPU (a10g→a100→h100). Do NOT switch training methods (e.g. SFT→LoRA) or reduce max_length — those change what the user gets. If OOM happens in sandbox, create a new sandbox with larger GPU hardware.
92
  - Never change the user's requested approach (training method, dataset, model, sequence length) without explicit approval.
93
  - If a tool call fails repeatedly for the same reason: stop and try a different approach.
94
  - Never silently substitute resources (datasets, models) — tell the user if something isn't available.
@@ -97,11 +96,10 @@ system_prompt: |
97
 
98
  Before ending your turn, verify:
99
  - Did you actually DO what the user asked, not just explain what you would do?
100
- - If you submitted a job: did you provide the job ID, monitoring URL, and expected duration?
101
- - If something failed: did you diagnose and fix it, or at minimum explain what went wrong?
102
- - For training jobs: did you include the Trackio dashboard URL?
103
 
104
- Do not stop after describing what you plan to do. Continue calling tools until the task is done.
105
  Do not mark plan tasks as completed if they failed or are only partially done.
106
 
107
  # Communication
@@ -109,14 +107,12 @@ system_prompt: |
109
  - Be concise and direct. No filler, no restating what the user said.
110
  - One-word answers when appropriate for simple questions.
111
  - Always include direct Hub URLs when referencing models, datasets, Spaces, or jobs.
112
- - After submitting async jobs: provide job ID, monitoring URL, expected duration and cost.
113
  - For errors: state what went wrong, why, and what you're doing to fix it.
114
  - Do not over-explain or present elaborate option menus for simple tasks. When the user's intent is clear, act on it. Present options only when there's genuine ambiguity.
115
- - Do not use emoji in regular text.
116
 
117
  # Tool usage
118
 
119
  - Execute multiple independent tool calls in parallel when possible.
120
- - HF_TOKEN is automatically available in job secrets — do not ask the user for it.
121
  - For training monitoring: include Trackio in the script and provide the dashboard URL.
122
  - For private/gated datasets: HF_TOKEN is needed — it's auto-loaded into job secrets.
 
14
 
15
  github_find_examples → github_read_file → explore_hf_docs + fetch_hf_docs
16
 
17
+ Skip research only for trivial non-code operations.
18
 
19
  # Mistakes you WILL make without research
20
 
 
28
 
29
  LOST MODELS: You will forget push_to_hub=True and hub_model_id in training config. Job storage is ephemeral — the filesystem is deleted when the job ends. Without push_to_hub, the trained model is permanently lost.
30
 
31
+ BATCH FAILURES: You will submit all ablation/batch jobs at once without testing that one works first. All will fail for the same bug. Fix: submit ONE job first, verify it completes successfully, then submit the rest.
32
 
33
  SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
34
 
35
+ HARDCODED UNAVAILABLE PACKAGES: You will forget to install necessary packages like 'flash-attn' for flash_attention_2 or other packages that aren't automatically installed in the job environment. Fix: install necessary packages before running the job.
36
 
37
+ SCOPE-CHANGING FIXES: Avoid at all costs! When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for and/or change the training task itself — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request and are grounded in research and examples. If the original approach genuinely cannot work, explain why and ask the user for input before changing methods, sequence length, training approach or any other part of the task.
38
 
39
  # When writing ML code
40
 
41
  Required sequence before any training/fine-tuning/inference script:
42
  1. Find working examples: github_find_examples (discover) → github_read_file (study)
43
  2. Check documentation: explore_hf_docs + fetch_hf_docs for trainer configs and parameters
44
+ 3. Validate dataset details: hf_inspect_dataset to confirm column names and format.
45
+ 4. Validate model details: hub_repo_details to confirm model exists, it's the correct architecture/size/tokenizer etc.
46
 
47
  Dataset format requirements by training method:
48
  SFT: "messages", "text", or "prompt"/"completion"
 
56
  - Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
57
  - push_to_hub=True and hub_model_id set
58
  - timeout: [value] (based on: [model size] on [hardware])
59
+ - Trackio monitoring included and working
60
 
61
  If you cannot fill in all items, stop and complete the missing steps first.
62
 
63
  For batch/ablation jobs: submit ONE job first. Check logs to confirm it starts training successfully. Only then submit the remaining jobs. Never submit all at once.
64
 
65
  Hardware sizing:
66
+ 1-3B params: a10g-largex2
67
+ 7-13B params: a100-large
68
+ 30B+ params: l40sx4 or a100x4
69
+ 70B+ params: a100x8
70
  Note: a10g-small and a10g-large have the SAME 24GB GPU memory. The difference is CPU/RAM only.
71
 
72
  # Sandbox-first development
73
 
74
  For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
75
+ sandbox_create → install depswrite script → test with small run → fix errors → launch via hf_jobs at scale
76
 
77
  Use GPU sandbox (t4-small minimum) when testing code that uses CUDA, bf16, or model loading. CPU sandboxes cannot test GPU code paths.
78
 
 
79
 
80
  # When a task has 3+ steps
81
 
 
87
  - Diagnose the actual error. Read the full error message and logs.
88
  - Do not retry the exact same thing. Identify what needs to change.
89
  - If an API/import error: check documentation for the correct API.
90
+ - If an OOM error: (1) reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally to keep effective batch size identical, (2) enable gradient_checkpointing=True, (3) upgrade to larger GPU (a10gx4→a100→a100x4→a100x8). Do NOT switch training methods (e.g. SFT→LoRA) or reduce max_length — those change what the user gets. If OOM happens in sandbox, create a new sandbox with larger GPU hardware.
91
  - Never change the user's requested approach (training method, dataset, model, sequence length) without explicit approval.
92
  - If a tool call fails repeatedly for the same reason: stop and try a different approach.
93
  - Never silently substitute resources (datasets, models) — tell the user if something isn't available.
 
96
 
97
  Before ending your turn, verify:
98
  - Did you actually DO what the user asked, not just explain what you would do?
99
+ - If something failed: did you diagnose and fix it, or at minimum explain what went wrong and ask for user input?
100
+ - For training jobs: did you include a working Trackio dashboard URL?
 
101
 
102
+ Do not stop after describing what you plan to do. Continue calling tools until the task is verifiably done.
103
  Do not mark plan tasks as completed if they failed or are only partially done.
104
 
105
  # Communication
 
107
  - Be concise and direct. No filler, no restating what the user said.
108
  - One-word answers when appropriate for simple questions.
109
  - Always include direct Hub URLs when referencing models, datasets, Spaces, or jobs.
 
110
  - For errors: state what went wrong, why, and what you're doing to fix it.
111
  - Do not over-explain or present elaborate option menus for simple tasks. When the user's intent is clear, act on it. Present options only when there's genuine ambiguity.
 
112
 
113
  # Tool usage
114
 
115
  - Execute multiple independent tool calls in parallel when possible.
116
+ - HF_TOKEN is automatically available in job secrets — no need to include it extra.
117
  - For training monitoring: include Trackio in the script and provide the dashboard URL.
118
  - For private/gated datasets: HF_TOKEN is needed — it's auto-loaded into job secrets.
agent/tools/dataset_tools.py CHANGED
@@ -393,8 +393,9 @@ HF_INSPECT_DATASET_TOOL_SPEC = {
393
  " SFT: needs 'messages', 'text', or 'prompt'/'completion'\n"
394
  " DPO: needs 'prompt', 'chosen', 'rejected'\n"
395
  " GRPO: needs 'prompt'\n"
 
396
  "Training will fail with KeyError if columns don't match.\n\n"
397
- "Also use to understand column names, data types, and available splits before writing any data loading code. "
398
  "Supports private/gated datasets when HF_TOKEN is set."
399
  ),
400
  "parameters": {
 
393
  " SFT: needs 'messages', 'text', or 'prompt'/'completion'\n"
394
  " DPO: needs 'prompt', 'chosen', 'rejected'\n"
395
  " GRPO: needs 'prompt'\n"
396
+ "All datasets used for training have to be in conversational ChatML format to be compatible with HF libraries.'\n"
397
  "Training will fail with KeyError if columns don't match.\n\n"
398
+ "Also use to get example datapoints, understand column names, data types, and available splits before writing any data loading code. "
399
  "Supports private/gated datasets when HF_TOKEN is set."
400
  ),
401
  "parameters": {
agent/tools/docs_tools.py CHANGED
@@ -845,9 +845,9 @@ DOC_ENDPOINTS = [
845
  EXPLORE_HF_DOCS_TOOL_SPEC = {
846
  "name": "explore_hf_docs",
847
  "description": (
848
- "Browse HF documentation structure — discover available pages with 200-char previews.\n\n"
849
- "Use this to complement working examples (from github_find_examples) with detailed parameter docs and API reference. "
850
- "Not a substitute for reading working code first.\n\n"
851
  "Pattern: explore_hf_docs (find relevant pages) → fetch_hf_docs (get full content).\n\n"
852
  "For training tasks: fetch the trainer config docs (SFTConfig, DPOConfig, GRPOConfig) to verify parameter names. "
853
  "Returns top 20 results by default; set max_results (max 50) to adjust."
@@ -924,8 +924,8 @@ HF_DOCS_FETCH_TOOL_SPEC = {
924
  "name": "fetch_hf_docs",
925
  "description": (
926
  "Fetch full markdown content of an HF documentation page. Use after explore_hf_docs.\n\n"
927
- "Critical for getting current trainer configuration parameters (SFTConfig, DPOConfig, etc.) "
928
- "before writing training scripts. Your internal knowledge of parameter names is outdated.\n\n"
929
  "Provide the full URL from explore_hf_docs results. The .md extension is added automatically."
930
  ),
931
  "parameters": {
 
845
  EXPLORE_HF_DOCS_TOOL_SPEC = {
846
  "name": "explore_hf_docs",
847
  "description": (
848
+ "Browse HF documentation structure — discover all available documentation with 200-char previews.\n\n"
849
+ "Use this to find relevant documentation and/or examples with detailed parameter docs and API reference. "
850
+ "To be used together with github_find_examples and github_read_file to find working examples and documentation.\n\n"
851
  "Pattern: explore_hf_docs (find relevant pages) → fetch_hf_docs (get full content).\n\n"
852
  "For training tasks: fetch the trainer config docs (SFTConfig, DPOConfig, GRPOConfig) to verify parameter names. "
853
  "Returns top 20 results by default; set max_results (max 50) to adjust."
 
924
  "name": "fetch_hf_docs",
925
  "description": (
926
  "Fetch full markdown content of an HF documentation page. Use after explore_hf_docs.\n\n"
927
+ "Critical for finding documentation e.g. current trainer configuration parameters (SFTConfig, DPOConfig, etc.) "
928
+ "Use for researching solutions and before writing training scripts. Your internal knowledge is outdated.\n\n"
929
  "Provide the full URL from explore_hf_docs results. The .md extension is added automatically."
930
  ),
931
  "parameters": {
agent/tools/github_find_examples.py CHANGED
@@ -405,10 +405,10 @@ def find_examples(
405
  GITHUB_FIND_EXAMPLES_TOOL_SPEC = {
406
  "name": "github_find_examples",
407
  "description": (
408
- "Find working example scripts in GitHub repositories (examples/, scripts/, tutorials/ directories). "
409
  "Uses fuzzy keyword matching.\n\n"
410
  "MANDATORY before writing any ML training, fine-tuning, or inference code. "
411
- "Your internal knowledge of HF library APIs is outdated — working examples show current API patterns.\n\n"
412
  "Sequence: github_find_examples → github_read_file (study the example) → implement based on what you found.\n\n"
413
  "Skip this only for: simple data queries, status checks, non-code tasks.\n\n"
414
  "Examples:\n"
 
405
  GITHUB_FIND_EXAMPLES_TOOL_SPEC = {
406
  "name": "github_find_examples",
407
  "description": (
408
+ "Find working example scripts in GitHub repositories (from a list of predetermined directories e.g. examples/, scripts/, tutorials/, etc.). "
409
  "Uses fuzzy keyword matching.\n\n"
410
  "MANDATORY before writing any ML training, fine-tuning, or inference code. "
411
+ "Your internal knowledge of library APIs is outdated — working examples show current API patterns.\n\n"
412
  "Sequence: github_find_examples → github_read_file (study the example) → implement based on what you found.\n\n"
413
  "Skip this only for: simple data queries, status checks, non-code tasks.\n\n"
414
  "Examples:\n"
agent/tools/jobs_tool.py CHANGED
@@ -9,7 +9,7 @@ import base64
9
  import http.client
10
  import os
11
  import re
12
- from typing import Any, Dict, Literal, Optional, Callable, Awaitable
13
 
14
  import httpx
15
  from huggingface_hub import HfApi
@@ -25,38 +25,33 @@ from agent.tools.utilities import (
25
  )
26
 
27
  # Hardware flavors
28
- CPU_FLAVORS = ["cpu-basic", "cpu-upgrade", "cpu-performance", "cpu-xl"]
29
  GPU_FLAVORS = [
30
- "sprx8",
31
- "zero-a10g",
32
  "t4-small",
33
  "t4-medium",
34
- "l4x1",
35
- "l4x4",
36
- "l40sx1",
37
- "l40sx4",
38
- "l40sx8",
39
  "a10g-small",
40
  "a10g-large",
41
  "a10g-largex2",
42
  "a10g-largex4",
43
  "a100-large",
44
- "h100",
45
- "h100x8",
 
 
 
 
 
46
  ]
47
 
48
  # Detailed specs for display (vCPU/RAM/GPU VRAM)
49
- CPU_FLAVORS_DESC = (
50
- "cpu-basic(2vCPU/16GB), cpu-upgrade(8vCPU/32GB), cpu-performance, cpu-xl"
51
- )
52
  GPU_FLAVORS_DESC = (
53
  "t4-small(4vCPU/15GB/GPU 16GB), t4-medium(8vCPU/30GB/GPU 16GB), "
54
- "l4x1(8vCPU/30GB/GPU 24GB), l4x4(48vCPU/186GB/GPU 96GB), "
55
- "l40sx1(8vCPU/62GB/GPU 48GB), l40sx4(48vCPU/382GB/GPU 192GB), l40sx8(192vCPU/1534GB/GPU 384GB), "
56
- "a10g-small(4vCPU/14GB/GPU 24GB), a10g-large(12vCPU/46GB/GPU 24GB), "
57
  "a10g-largex2(24vCPU/92GB/GPU 48GB), a10g-largex4(48vCPU/184GB/GPU 96GB), "
58
- "a100-large(12vCPU/142GB/GPU 80GB), h100(23vCPU/240GB/GPU 80GB), h100x8(184vCPU/1920GB/GPU 640GB), "
59
- "zero-a10g(dynamic alloc)"
 
60
  )
61
  SPECIALIZED_FLAVORS = ["inf2x6"]
62
  ALL_FLAVORS = CPU_FLAVORS + GPU_FLAVORS + SPECIALIZED_FLAVORS
@@ -389,7 +384,9 @@ class HfJobsTool:
389
  def log_producer():
390
  try:
391
  # fetch_job_logs is a blocking sync generator
392
- logs_gen = self.api.fetch_job_logs(job_id=job_id, namespace=namespace)
 
 
393
  for line in logs_gen:
394
  # Push line to queue thread-safely
395
  loop.call_soon_threadsafe(queue.put_nowait, line)
@@ -907,16 +904,14 @@ HF_JOBS_TOOL_SPEC = {
907
  "Common picks: t4-small ($0.60/hr, 1-3B), a10g-large ($2/hr, 7-13B), a100-large ($4/hr, 30B+), h100 ($6/hr, 70B+). "
908
  "Note: a10g-small and a10g-large have the SAME 24GB GPU — the difference is CPU/RAM only.\n\n"
909
  "OOM RECOVERY: When a training job fails with CUDA OOM:\n"
910
- "1. Reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally (keeps effective batch size identical)\n"
911
  "2. Enable gradient_checkpointing=True\n"
912
  "3. Upgrade to larger GPU (a10g→a100→h100)\n"
913
  "Do NOT switch training methods (e.g. full SFT to LoRA) or reduce max_length — those change what the user gets and require explicit approval.\n\n"
914
- "After submission: return immediately with job ID, monitoring URL, expected duration and cost. "
915
- "Do not poll logs unless the user asks.\n\n"
916
  "Examples:\n"
917
- "Training: {'operation': 'run', 'script': '/app/train.py', 'dependencies': ['transformers', 'trl', 'torch', 'datasets', 'trackio'], 'hardware_flavor': 'a10g-large', 'timeout': '4h'}\n"
918
- "Data processing: {'operation': 'run', 'script': '<inline>', 'dependencies': ['datasets'], 'hardware_flavor': 'cpu-upgrade', 'timeout': '2h'}\n"
919
  "Monitor: {'operation': 'ps'}, {'operation': 'logs', 'job_id': 'xxx'}, {'operation': 'cancel', 'job_id': 'xxx'}"
 
920
  ),
921
  "parameters": {
922
  "type": "object",
@@ -1030,6 +1025,7 @@ async def hf_jobs_handler(
1030
  )
1031
  if is_path:
1032
  import shlex
 
1033
  result = await asyncio.to_thread(sandbox.bash, f"cat {shlex.quote(script)}")
1034
  if not result.success:
1035
  return f"Failed to read {script} from sandbox: {result.error}", False
 
9
  import http.client
10
  import os
11
  import re
12
+ from typing import Any, Awaitable, Callable, Dict, Literal, Optional
13
 
14
  import httpx
15
  from huggingface_hub import HfApi
 
25
  )
26
 
27
  # Hardware flavors
28
+ CPU_FLAVORS = ["cpu-basic", "cpu-upgrade"]
29
  GPU_FLAVORS = [
 
 
30
  "t4-small",
31
  "t4-medium",
 
 
 
 
 
32
  "a10g-small",
33
  "a10g-large",
34
  "a10g-largex2",
35
  "a10g-largex4",
36
  "a100-large",
37
+ "a100x4",
38
+ "a100x8",
39
+ "l4x1",
40
+ "l4x4",
41
+ "l40sx1",
42
+ "l40sx4",
43
+ "l40sx8",
44
  ]
45
 
46
  # Detailed specs for display (vCPU/RAM/GPU VRAM)
47
+ CPU_FLAVORS_DESC = "cpu-basic(2vCPU/16GB), cpu-upgrade(8vCPU/32GB)"
 
 
48
  GPU_FLAVORS_DESC = (
49
  "t4-small(4vCPU/15GB/GPU 16GB), t4-medium(8vCPU/30GB/GPU 16GB), "
50
+ "a10g-small(4vCPU/15GB/GPU 24GB), a10g-large(12vCPU/46GB/GPU 24GB), "
 
 
51
  "a10g-largex2(24vCPU/92GB/GPU 48GB), a10g-largex4(48vCPU/184GB/GPU 96GB), "
52
+ "a100-large(12vCPU/142GB/GPU 80GB), a100x4(48vCPU/568GB/GPU 320GB), a100x8(96vCPU/1136GB/GPU 640GB), "
53
+ "l4x1(8vCPU/30GB/GPU 24GB), l4x4(48vCPU/186GB/GPU 96GB), "
54
+ "l40sx1(8vCPU/62GB/GPU 48GB), l40sx4(48vCPU/382GB/GPU 192GB), l40sx8(192vCPU/1534GB/GPU 384GB)"
55
  )
56
  SPECIALIZED_FLAVORS = ["inf2x6"]
57
  ALL_FLAVORS = CPU_FLAVORS + GPU_FLAVORS + SPECIALIZED_FLAVORS
 
384
  def log_producer():
385
  try:
386
  # fetch_job_logs is a blocking sync generator
387
+ logs_gen = self.api.fetch_job_logs(
388
+ job_id=job_id, namespace=namespace
389
+ )
390
  for line in logs_gen:
391
  # Push line to queue thread-safely
392
  loop.call_soon_threadsafe(queue.put_nowait, line)
 
904
  "Common picks: t4-small ($0.60/hr, 1-3B), a10g-large ($2/hr, 7-13B), a100-large ($4/hr, 30B+), h100 ($6/hr, 70B+). "
905
  "Note: a10g-small and a10g-large have the SAME 24GB GPU — the difference is CPU/RAM only.\n\n"
906
  "OOM RECOVERY: When a training job fails with CUDA OOM:\n"
907
+ "1. Reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally (keep effective batch size identical)\n"
908
  "2. Enable gradient_checkpointing=True\n"
909
  "3. Upgrade to larger GPU (a10g→a100→h100)\n"
910
  "Do NOT switch training methods (e.g. full SFT to LoRA) or reduce max_length — those change what the user gets and require explicit approval.\n\n"
 
 
911
  "Examples:\n"
912
+ "Training: {'operation': 'run', 'script': '/app/train.py', 'dependencies': ['transformers', 'trl', 'torch', 'datasets', 'trackio'], 'hardware_flavor': 'a100-large', 'timeout': '8h'}\n"
 
913
  "Monitor: {'operation': 'ps'}, {'operation': 'logs', 'job_id': 'xxx'}, {'operation': 'cancel', 'job_id': 'xxx'}"
914
+ "Docker: {'operation': 'run', 'command': ['duckdb', '-c', 'select 1 + 2'], 'image': 'duckdb/duckdb', 'hardware_flavor': 'cpu-basic', 'timeout': '1h'}\n"
915
  ),
916
  "parameters": {
917
  "type": "object",
 
1025
  )
1026
  if is_path:
1027
  import shlex
1028
+
1029
  result = await asyncio.to_thread(sandbox.bash, f"cat {shlex.quote(script)}")
1030
  if not result.success:
1031
  return f"Failed to read {script} from sandbox: {result.error}", False