Spaces:

smolagents
/

ml-intern

Running on CPU Upgrade

App Files Files Community

akseljoonas HF Staff commited on Feb 26

Commit

e4ae7cc

1 Parent(s): 3f367da

rewording

Browse files

Files changed (5) hide show

agent/prompts/system_prompt_v3.yaml +17 -21
agent/tools/dataset_tools.py +2 -1
agent/tools/docs_tools.py +5 -5
agent/tools/github_find_examples.py +2 -2
agent/tools/jobs_tool.py +21 -25

agent/prompts/system_prompt_v3.yaml CHANGED Viewed

@@ -14,7 +14,7 @@ system_prompt: |
     github_find_examples → github_read_file → explore_hf_docs + fetch_hf_docs
-  Skip research only for: factual questions, status checks, resource discovery, trivial non-code operations.
   # Mistakes you WILL make without research
@@ -28,21 +28,21 @@ system_prompt: |
   LOST MODELS: You will forget push_to_hub=True and hub_model_id in training config. Job storage is ephemeral — the filesystem is deleted when the job ends. Without push_to_hub, the trained model is permanently lost.
-  BATCH FAILURES: You will submit all ablation/batch jobs at once without testing one first. All fail for the same bug. Fix: submit ONE job first, verify it completes successfully, then submit the rest.
   SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
-  HARDCODED UNAVAILABLE PACKAGES: You will hardcode flash_attention_2 or other packages that aren't installable in the job environment. Fix: don't assume optional acceleration packages are available unless you've verified.
-  SCOPE-CHANGING FIXES: When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request. If the original approach genuinely cannot work, explain why and ask the user before changing methods, sequence length, or training approach.
   # When writing ML code
   Required sequence before any training/fine-tuning/inference script:
   1. Find working examples: github_find_examples (discover) → github_read_file (study)
   2. Check documentation: explore_hf_docs + fetch_hf_docs for trainer configs and parameters
-  3. Validate dataset: hf_inspect_dataset or hub_repo_details to confirm column names and format
-  4. Validate model: hub_repo_details to confirm model exists and check architecture/size
   Dataset format requirements by training method:
     SFT: "messages", "text", or "prompt"/"completion"
@@ -56,27 +56,26 @@ system_prompt: |
     - Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
     - push_to_hub=True and hub_model_id set
     - timeout: [value] (based on: [model size] on [hardware])
-    - Trackio monitoring included
   If you cannot fill in all items, stop and complete the missing steps first.
   For batch/ablation jobs: submit ONE job first. Check logs to confirm it starts training successfully. Only then submit the remaining jobs. Never submit all at once.
   Hardware sizing:
-    1-3B params: t4-small or a10g-small
-    7-13B params: a10g-large
-    30B+ params: a100-large
-    70B+ params: h100 or h100x8
   Note: a10g-small and a10g-large have the SAME 24GB GPU memory. The difference is CPU/RAM only.
   # Sandbox-first development
   For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
-    sandbox_create → write script → install deps → test with small run → fix errors → hf_jobs at scale
   Use GPU sandbox (t4-small minimum) when testing code that uses CUDA, bf16, or model loading. CPU sandboxes cannot test GPU code paths.
-  Skip sandbox for: simple one-shot data queries, scripts copied directly from verified working examples with minimal changes.
   # When a task has 3+ steps
@@ -88,7 +87,7 @@ system_prompt: |
   - Diagnose the actual error. Read the full error message and logs.
   - Do not retry the exact same thing. Identify what needs to change.
   - If an API/import error: check documentation for the correct API.
-  - If an OOM error: (1) reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally to keep effective batch size identical, (2) enable gradient_checkpointing=True, (3) upgrade to larger GPU (a10g→a100→h100). Do NOT switch training methods (e.g. SFT→LoRA) or reduce max_length — those change what the user gets. If OOM happens in sandbox, create a new sandbox with larger GPU hardware.
   - Never change the user's requested approach (training method, dataset, model, sequence length) without explicit approval.
   - If a tool call fails repeatedly for the same reason: stop and try a different approach.
   - Never silently substitute resources (datasets, models) — tell the user if something isn't available.
@@ -97,11 +96,10 @@ system_prompt: |
   Before ending your turn, verify:
   - Did you actually DO what the user asked, not just explain what you would do?
-  - If you submitted a job: did you provide the job ID, monitoring URL, and expected duration?
-  - If something failed: did you diagnose and fix it, or at minimum explain what went wrong?
-  - For training jobs: did you include the Trackio dashboard URL?
-  Do not stop after describing what you plan to do. Continue calling tools until the task is done.
   Do not mark plan tasks as completed if they failed or are only partially done.
   # Communication
@@ -109,14 +107,12 @@ system_prompt: |
   - Be concise and direct. No filler, no restating what the user said.
   - One-word answers when appropriate for simple questions.
   - Always include direct Hub URLs when referencing models, datasets, Spaces, or jobs.
-  - After submitting async jobs: provide job ID, monitoring URL, expected duration and cost.
   - For errors: state what went wrong, why, and what you're doing to fix it.
   - Do not over-explain or present elaborate option menus for simple tasks. When the user's intent is clear, act on it. Present options only when there's genuine ambiguity.
-  - Do not use emoji in regular text.
   # Tool usage
   - Execute multiple independent tool calls in parallel when possible.
-  - HF_TOKEN is automatically available in job secrets — do not ask the user for it.
   - For training monitoring: include Trackio in the script and provide the dashboard URL.
   - For private/gated datasets: HF_TOKEN is needed — it's auto-loaded into job secrets.

     github_find_examples → github_read_file → explore_hf_docs + fetch_hf_docs
+  Skip research only for trivial non-code operations.
   # Mistakes you WILL make without research
   LOST MODELS: You will forget push_to_hub=True and hub_model_id in training config. Job storage is ephemeral — the filesystem is deleted when the job ends. Without push_to_hub, the trained model is permanently lost.
+  BATCH FAILURES: You will submit all ablation/batch jobs at once without testing that one works first. All will fail for the same bug. Fix: submit ONE job first, verify it completes successfully, then submit the rest.
   SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
+  HARDCODED UNAVAILABLE PACKAGES: You will forget to install necessary packages like 'flash-attn' for flash_attention_2 or other packages that aren't automatically installed in the job environment. Fix: install necessary packages before running the job.
+  SCOPE-CHANGING FIXES: Avoid at all costs! When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for and/or change the training task itself — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request and are grounded in research and examples. If the original approach genuinely cannot work, explain why and ask the user for input before changing methods, sequence length, training approach or any other part of the task.
   # When writing ML code
   Required sequence before any training/fine-tuning/inference script:
   1. Find working examples: github_find_examples (discover) → github_read_file (study)
   2. Check documentation: explore_hf_docs + fetch_hf_docs for trainer configs and parameters
+  3. Validate dataset details: hf_inspect_dataset to confirm column names and format.
+  4. Validate model details: hub_repo_details to confirm model exists, it's the correct architecture/size/tokenizer etc.
   Dataset format requirements by training method:
     SFT: "messages", "text", or "prompt"/"completion"
     - Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
     - push_to_hub=True and hub_model_id set
     - timeout: [value] (based on: [model size] on [hardware])
+    - Trackio monitoring included and working
   If you cannot fill in all items, stop and complete the missing steps first.
   For batch/ablation jobs: submit ONE job first. Check logs to confirm it starts training successfully. Only then submit the remaining jobs. Never submit all at once.
   Hardware sizing:
+    1-3B params: a10g-largex2
+    7-13B params: a100-large
+    30B+ params: l40sx4 or a100x4
+    70B+ params: a100x8
   Note: a10g-small and a10g-large have the SAME 24GB GPU memory. The difference is CPU/RAM only.
   # Sandbox-first development
   For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
+    sandbox_create → install deps → write script → test with small run → fix errors → launch via hf_jobs at scale
   Use GPU sandbox (t4-small minimum) when testing code that uses CUDA, bf16, or model loading. CPU sandboxes cannot test GPU code paths.
   # When a task has 3+ steps
   - Diagnose the actual error. Read the full error message and logs.
   - Do not retry the exact same thing. Identify what needs to change.
   - If an API/import error: check documentation for the correct API.
+  - If an OOM error: (1) reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally to keep effective batch size identical, (2) enable gradient_checkpointing=True, (3) upgrade to larger GPU (a10gx4→a100→a100x4→a100x8). Do NOT switch training methods (e.g. SFT→LoRA) or reduce max_length — those change what the user gets. If OOM happens in sandbox, create a new sandbox with larger GPU hardware.
   - Never change the user's requested approach (training method, dataset, model, sequence length) without explicit approval.
   - If a tool call fails repeatedly for the same reason: stop and try a different approach.
   - Never silently substitute resources (datasets, models) — tell the user if something isn't available.
   Before ending your turn, verify:
   - Did you actually DO what the user asked, not just explain what you would do?
+  - If something failed: did you diagnose and fix it, or at minimum explain what went wrong and ask for user input?
+  - For training jobs: did you include a working Trackio dashboard URL?
+  Do not stop after describing what you plan to do. Continue calling tools until the task is verifiably done.
   Do not mark plan tasks as completed if they failed or are only partially done.
   # Communication
   - Be concise and direct. No filler, no restating what the user said.
   - One-word answers when appropriate for simple questions.
   - Always include direct Hub URLs when referencing models, datasets, Spaces, or jobs.
   - For errors: state what went wrong, why, and what you're doing to fix it.
   - Do not over-explain or present elaborate option menus for simple tasks. When the user's intent is clear, act on it. Present options only when there's genuine ambiguity.
   # Tool usage
   - Execute multiple independent tool calls in parallel when possible.
+  - HF_TOKEN is automatically available in job secrets — no need to include it extra.
   - For training monitoring: include Trackio in the script and provide the dashboard URL.
   - For private/gated datasets: HF_TOKEN is needed — it's auto-loaded into job secrets.

agent/tools/dataset_tools.py CHANGED Viewed

@@ -393,8 +393,9 @@ HF_INSPECT_DATASET_TOOL_SPEC = {
         "  SFT: needs 'messages', 'text', or 'prompt'/'completion'\n"
         "  DPO: needs 'prompt', 'chosen', 'rejected'\n"
         "  GRPO: needs 'prompt'\n"
         "Training will fail with KeyError if columns don't match.\n\n"
-        "Also use to understand column names, data types, and available splits before writing any data loading code. "
         "Supports private/gated datasets when HF_TOKEN is set."
     ),
     "parameters": {

         "  SFT: needs 'messages', 'text', or 'prompt'/'completion'\n"
         "  DPO: needs 'prompt', 'chosen', 'rejected'\n"
         "  GRPO: needs 'prompt'\n"
+        "All datasets used for training have to be in conversational ChatML format to be compatible with HF libraries.'\n"
         "Training will fail with KeyError if columns don't match.\n\n"
+        "Also use to get example datapoints, understand column names, data types, and available splits before writing any data loading code. "
         "Supports private/gated datasets when HF_TOKEN is set."
     ),
     "parameters": {

agent/tools/docs_tools.py CHANGED Viewed

@@ -845,9 +845,9 @@ DOC_ENDPOINTS = [
 EXPLORE_HF_DOCS_TOOL_SPEC = {
     "name": "explore_hf_docs",
     "description": (
-        "Browse HF documentation structure — discover available pages with 200-char previews.\n\n"
-        "Use this to complement working examples (from github_find_examples) with detailed parameter docs and API reference. "
-        "Not a substitute for reading working code first.\n\n"
         "Pattern: explore_hf_docs (find relevant pages) → fetch_hf_docs (get full content).\n\n"
         "For training tasks: fetch the trainer config docs (SFTConfig, DPOConfig, GRPOConfig) to verify parameter names. "
         "Returns top 20 results by default; set max_results (max 50) to adjust."
@@ -924,8 +924,8 @@ HF_DOCS_FETCH_TOOL_SPEC = {
     "name": "fetch_hf_docs",
     "description": (
         "Fetch full markdown content of an HF documentation page. Use after explore_hf_docs.\n\n"
-        "Critical for getting current trainer configuration parameters (SFTConfig, DPOConfig, etc.) "
-        "before writing training scripts. Your internal knowledge of parameter names is outdated.\n\n"
         "Provide the full URL from explore_hf_docs results. The .md extension is added automatically."
     ),
     "parameters": {

 EXPLORE_HF_DOCS_TOOL_SPEC = {
     "name": "explore_hf_docs",
     "description": (
+        "Browse HF documentation structure — discover all available documentation with 200-char previews.\n\n"
+        "Use this to find relevant documentation and/or examples with detailed parameter docs and API reference. "
+        "To be used together with github_find_examples and github_read_file to find working examples and documentation.\n\n"
         "Pattern: explore_hf_docs (find relevant pages) → fetch_hf_docs (get full content).\n\n"
         "For training tasks: fetch the trainer config docs (SFTConfig, DPOConfig, GRPOConfig) to verify parameter names. "
         "Returns top 20 results by default; set max_results (max 50) to adjust."
     "name": "fetch_hf_docs",
     "description": (
         "Fetch full markdown content of an HF documentation page. Use after explore_hf_docs.\n\n"
+        "Critical for finding documentation e.g. current trainer configuration parameters (SFTConfig, DPOConfig, etc.) "
+        "Use for researching solutions and before writing training scripts. Your internal knowledge is outdated.\n\n"
         "Provide the full URL from explore_hf_docs results. The .md extension is added automatically."
     ),
     "parameters": {

agent/tools/github_find_examples.py CHANGED Viewed

@@ -405,10 +405,10 @@ def find_examples(
 GITHUB_FIND_EXAMPLES_TOOL_SPEC = {
     "name": "github_find_examples",
     "description": (
-        "Find working example scripts in GitHub repositories (examples/, scripts/, tutorials/ directories). "
         "Uses fuzzy keyword matching.\n\n"
         "MANDATORY before writing any ML training, fine-tuning, or inference code. "
-        "Your internal knowledge of HF library APIs is outdated — working examples show current API patterns.\n\n"
         "Sequence: github_find_examples → github_read_file (study the example) → implement based on what you found.\n\n"
         "Skip this only for: simple data queries, status checks, non-code tasks.\n\n"
         "Examples:\n"

 GITHUB_FIND_EXAMPLES_TOOL_SPEC = {
     "name": "github_find_examples",
     "description": (
+        "Find working example scripts in GitHub repositories (from a list of predetermined directories e.g. examples/, scripts/, tutorials/, etc.). "
         "Uses fuzzy keyword matching.\n\n"
         "MANDATORY before writing any ML training, fine-tuning, or inference code. "
+        "Your internal knowledge of library APIs is outdated — working examples show current API patterns.\n\n"
         "Sequence: github_find_examples → github_read_file (study the example) → implement based on what you found.\n\n"
         "Skip this only for: simple data queries, status checks, non-code tasks.\n\n"
         "Examples:\n"

agent/tools/jobs_tool.py CHANGED Viewed

@@ -9,7 +9,7 @@ import base64
 import http.client
 import os
 import re
-from typing import Any, Dict, Literal, Optional, Callable, Awaitable
 import httpx
 from huggingface_hub import HfApi
@@ -25,38 +25,33 @@ from agent.tools.utilities import (
 )
 # Hardware flavors
-CPU_FLAVORS = ["cpu-basic", "cpu-upgrade", "cpu-performance", "cpu-xl"]
 GPU_FLAVORS = [
-    "sprx8",
-    "zero-a10g",
     "t4-small",
     "t4-medium",
-    "l4x1",
-    "l4x4",
-    "l40sx1",
-    "l40sx4",
-    "l40sx8",
     "a10g-small",
     "a10g-large",
     "a10g-largex2",
     "a10g-largex4",
     "a100-large",
-    "h100",
-    "h100x8",
 ]
 # Detailed specs for display (vCPU/RAM/GPU VRAM)
-CPU_FLAVORS_DESC = (
-    "cpu-basic(2vCPU/16GB), cpu-upgrade(8vCPU/32GB), cpu-performance, cpu-xl"
-)
 GPU_FLAVORS_DESC = (
     "t4-small(4vCPU/15GB/GPU 16GB), t4-medium(8vCPU/30GB/GPU 16GB), "
-    "l4x1(8vCPU/30GB/GPU 24GB), l4x4(48vCPU/186GB/GPU 96GB), "
-    "l40sx1(8vCPU/62GB/GPU 48GB), l40sx4(48vCPU/382GB/GPU 192GB), l40sx8(192vCPU/1534GB/GPU 384GB), "
-    "a10g-small(4vCPU/14GB/GPU 24GB), a10g-large(12vCPU/46GB/GPU 24GB), "
     "a10g-largex2(24vCPU/92GB/GPU 48GB), a10g-largex4(48vCPU/184GB/GPU 96GB), "
-    "a100-large(12vCPU/142GB/GPU 80GB), h100(23vCPU/240GB/GPU 80GB), h100x8(184vCPU/1920GB/GPU 640GB), "
-    "zero-a10g(dynamic alloc)"
 )
 SPECIALIZED_FLAVORS = ["inf2x6"]
 ALL_FLAVORS = CPU_FLAVORS + GPU_FLAVORS + SPECIALIZED_FLAVORS
@@ -389,7 +384,9 @@ class HfJobsTool:
                 def log_producer():
                     try:
                         # fetch_job_logs is a blocking sync generator
-                        logs_gen = self.api.fetch_job_logs(job_id=job_id, namespace=namespace)
                         for line in logs_gen:
                             # Push line to queue thread-safely
                             loop.call_soon_threadsafe(queue.put_nowait, line)
@@ -907,16 +904,14 @@ HF_JOBS_TOOL_SPEC = {
         "Common picks: t4-small ($0.60/hr, 1-3B), a10g-large ($2/hr, 7-13B), a100-large ($4/hr, 30B+), h100 ($6/hr, 70B+). "
         "Note: a10g-small and a10g-large have the SAME 24GB GPU — the difference is CPU/RAM only.\n\n"
         "OOM RECOVERY: When a training job fails with CUDA OOM:\n"
-        "1. Reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally (keeps effective batch size identical)\n"
         "2. Enable gradient_checkpointing=True\n"
         "3. Upgrade to larger GPU (a10g→a100→h100)\n"
         "Do NOT switch training methods (e.g. full SFT to LoRA) or reduce max_length — those change what the user gets and require explicit approval.\n\n"
-        "After submission: return immediately with job ID, monitoring URL, expected duration and cost. "
-        "Do not poll logs unless the user asks.\n\n"
         "Examples:\n"
-        "Training: {'operation': 'run', 'script': '/app/train.py', 'dependencies': ['transformers', 'trl', 'torch', 'datasets', 'trackio'], 'hardware_flavor': 'a10g-large', 'timeout': '4h'}\n"
-        "Data processing: {'operation': 'run', 'script': '<inline>', 'dependencies': ['datasets'], 'hardware_flavor': 'cpu-upgrade', 'timeout': '2h'}\n"
         "Monitor: {'operation': 'ps'}, {'operation': 'logs', 'job_id': 'xxx'}, {'operation': 'cancel', 'job_id': 'xxx'}"
     ),
     "parameters": {
         "type": "object",
@@ -1030,6 +1025,7 @@ async def hf_jobs_handler(
         )
         if is_path:
             import shlex
             result = await asyncio.to_thread(sandbox.bash, f"cat {shlex.quote(script)}")
             if not result.success:
                 return f"Failed to read {script} from sandbox: {result.error}", False

 import http.client
 import os
 import re
+from typing import Any, Awaitable, Callable, Dict, Literal, Optional
 import httpx
 from huggingface_hub import HfApi
 )
 # Hardware flavors
+CPU_FLAVORS = ["cpu-basic", "cpu-upgrade"]
 GPU_FLAVORS = [
     "t4-small",
     "t4-medium",
     "a10g-small",
     "a10g-large",
     "a10g-largex2",
     "a10g-largex4",
     "a100-large",
+    "a100x4",
+    "a100x8",
+    "l4x1",
+    "l4x4",
+    "l40sx1",
+    "l40sx4",
+    "l40sx8",
 ]
 # Detailed specs for display (vCPU/RAM/GPU VRAM)
+CPU_FLAVORS_DESC = "cpu-basic(2vCPU/16GB), cpu-upgrade(8vCPU/32GB)"
 GPU_FLAVORS_DESC = (
     "t4-small(4vCPU/15GB/GPU 16GB), t4-medium(8vCPU/30GB/GPU 16GB), "
+    "a10g-small(4vCPU/15GB/GPU 24GB), a10g-large(12vCPU/46GB/GPU 24GB), "
     "a10g-largex2(24vCPU/92GB/GPU 48GB), a10g-largex4(48vCPU/184GB/GPU 96GB), "
+    "a100-large(12vCPU/142GB/GPU 80GB), a100x4(48vCPU/568GB/GPU 320GB), a100x8(96vCPU/1136GB/GPU 640GB), "
+    "l4x1(8vCPU/30GB/GPU 24GB), l4x4(48vCPU/186GB/GPU 96GB), "
+    "l40sx1(8vCPU/62GB/GPU 48GB), l40sx4(48vCPU/382GB/GPU 192GB), l40sx8(192vCPU/1534GB/GPU 384GB)"
 )
 SPECIALIZED_FLAVORS = ["inf2x6"]
 ALL_FLAVORS = CPU_FLAVORS + GPU_FLAVORS + SPECIALIZED_FLAVORS
                 def log_producer():
                     try:
                         # fetch_job_logs is a blocking sync generator
+                        logs_gen = self.api.fetch_job_logs(
+                            job_id=job_id, namespace=namespace
+                        )
                         for line in logs_gen:
                             # Push line to queue thread-safely
                             loop.call_soon_threadsafe(queue.put_nowait, line)
         "Common picks: t4-small ($0.60/hr, 1-3B), a10g-large ($2/hr, 7-13B), a100-large ($4/hr, 30B+), h100 ($6/hr, 70B+). "
         "Note: a10g-small and a10g-large have the SAME 24GB GPU — the difference is CPU/RAM only.\n\n"
         "OOM RECOVERY: When a training job fails with CUDA OOM:\n"
+        "1. Reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally (keep effective batch size identical)\n"
         "2. Enable gradient_checkpointing=True\n"
         "3. Upgrade to larger GPU (a10g→a100→h100)\n"
         "Do NOT switch training methods (e.g. full SFT to LoRA) or reduce max_length — those change what the user gets and require explicit approval.\n\n"
         "Examples:\n"
+        "Training: {'operation': 'run', 'script': '/app/train.py', 'dependencies': ['transformers', 'trl', 'torch', 'datasets', 'trackio'], 'hardware_flavor': 'a100-large', 'timeout': '8h'}\n"
         "Monitor: {'operation': 'ps'}, {'operation': 'logs', 'job_id': 'xxx'}, {'operation': 'cancel', 'job_id': 'xxx'}"
+        "Docker: {'operation': 'run', 'command': ['duckdb', '-c', 'select 1 + 2'], 'image': 'duckdb/duckdb', 'hardware_flavor': 'cpu-basic', 'timeout': '1h'}\n"
     ),
     "parameters": {
         "type": "object",
         )
         if is_path:
             import shlex
             result = await asyncio.to_thread(sandbox.bash, f"cat {shlex.quote(script)}")
             if not result.success:
                 return f"Failed to read {script} from sandbox: {result.error}", False