Spaces:

ccyloopss
/

HPCOpenenv

Paused

App Files Files Community

HPCOpenenv / docs /hf_blog.md

huggingmenfordays

deploy: ccyloopss/HPCOpenenv — with OPENENV_API_KEY auth guard

bc35a94 20 days ago

preview code

raw

history blame contribute delete

4.65 kB

teaching an llm to sre: EnterpriseHPC-v0 on openenv

tl;dr we shipped an openenv compliant gymnasium environment that simulates a 224 core rocky linux hpc cluster inside a single user namespace sandbox, resets in 2.40 ms p50, and trains Qwen/Qwen2.5-Coder-7B-Instruct with trl grpo to recover a broken cluster end to end. the same training script can run locally, in colab, or against a fleet of hf spaces via --env-urls.

why

the slowest, highest stakes work in enterprise infra is multi-app incident response. an open ondemand portal returns 502. the compute partition is drained. there is a failing slurmd somewhere. to fix it you navigate login -> compute-01 over ssh, inspect route configs and munge keys, restart services in the right order, and verify via curl. frontier llms have never trained on that loop.

EnterpriseHPC-v0 turns that loop into an rl environment.

what is inside

nested bwrap for lateral movement. ssh compute-01 chroots the shell into a separate rootfs so hostname and filesystem paths reflect the new node
fuse-overlayfs with upperdir and workdir on /dev/shm for microsecond copy on write. kernel overlay and a copy fallback are supported for hosts without fuse privileges
a deterministic slurm state machine in /mnt/shared/slurm_state.json with fcntl locks so many parallel rollouts cannot corrupt each other
python stubs for sinfo, squeue, systemctl, scontrol, curl, ssh that read and mutate the json state, and a lightweight open ondemand http server that returns 502 until the underlying fault is fixed
three scenarios ship today and are rotated per rollout
- hpc_outage compute-01 drain from a broken route-eth0
- hpc_munge compute-01 drain from a munge key with wrong mode and a broken route (chained)
- hpc_pid_stale slurmd refuses to restart after reboot because of a leftover /var/run/slurmd.pid
the gymnasium env EnterpriseHPC-v0 wraps it all with pexpect so the policy experiences real interactive bash prompts

how fast

| mount | n | p50 ms | p95 ms | p99 ms | max ms |
| --- | ---: | ---: | ---: | ---: | ---: |
| copy | 100 | 2.40 | 2.56 | 2.58 | 2.87 |

that is in the ci friendly copy mode. real fuse-overlayfs on a linux host drops well under 1 ms. reset latency is no longer the grpo bottleneck.

training with qwen2.5-coder

local training with unsloth + 4bit qlora:

python -m training.train_hpc_outage \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --group-size 4 --max-turns 12 \
  --num-train-steps 100 \
  --scenarios hpc_outage,hpc_munge,hpc_pid_stale

remote training against hosted openenv spaces (same shape as the trl + openenv launch example, swapped to a code-tuned 7b policy):

python -m training.hpc_openenv_gemma \
  --env-urls https://<user>-enterprise-hpc-openenv.hf.space \
             https://<user>-enterprise-hpc-openenv-2.hf.space \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --group-size 4 --max-turns 12 --num-train-steps 200

submit to hf jobs:

python -m training.hf_jobs \
  --env-urls https://<user>-enterprise-hpc-openenv.hf.space \
  --gpu a10g-large \
  --num-train-steps 300

the training scripts use unsloth for 4bit qlora loading and trl GRPOTrainer with a custom rollout function that drives the env one turn at a time. the reward is binary from the deterministic task grader, which is exactly the signal grpo wants.

a colab notebook at training/hpc_colab.ipynb runs both the local and remote paths on a single t4 / l4 / a100.

what the agent learns

before training a random policy wanders around sinfo and never edits the route file. after ~100 steps of grpo the agent reliably:

runs sinfo and squeue to locate the drained node
lateral moves with ssh compute-01
inspects /etc/sysconfig/network-scripts/route-eth0
writes the correct route with printf ... > (no heredocs allowed)
for the munge variant also chmod 0400 /etc/munge/munge.key
restarts munge then slurmd in that order
exits back to login and verifies with curl -I http://localhost:8080

prove it is solvable

before any training, reviewers can run:

make gold   # deterministic gold-trajectory verifier
make eval   # gold vs random vs bad policies, writes runs/eval/leaderboard.md
make bench  # reset-latency benchmark

try it

repo: https://github.com/your-org/low-taper-fade-openenv-scaler
hf space (env server): https://huggingface.co/spaces/your-org/enterprise-hpc-openenv
colab: training/hpc_colab.ipynb
pitch doc: docs/pitch.md
hf jobs guide: docs/hf_jobs.md
spaces deploy: docs/hf_spaces_deploy.md