Spaces:
Running
start.sh: gate 5 heavy harvest launchers behind LOW_MEM
Browse filesCPU-Basic Space cap is 16 GB. With uvicorn + 5 nohup'd bash launchers
(dedup-bootstrap, dataset-enrich, kaggle-trainer, lightning-trainer,
dataset-mirror) running concurrently while harvest workers hammered
/cursor/* at ~100 req/min, the container blew through 16 GB roughly
5 min after every boot and HF auto-killed it. The watchdog correctly
flagged 'HTTP layer hung' but the actual signal was OOM_KILL.
Fix: gate the 5 launchers behind LOW_MEM=0. LOW_MEM defaults to 1 (set
in the global env at line ~20 BEFORE any reference, so 'set -u' is
happy) so on CPU-Basic the launchers no-op and the Space stays
under ~2 GB resident β just FastAPI + Redis + status server.
Harvest pipeline: the 3 recurring launchers (dataset-enrich every 30m,
dataset-mirror every 6h, lightning-trainer daily 02:00 UTC) move to
GCP scheduler via hermes-jobs.json entries. dedup-bootstrap is
one-shot so doesn't need a cron. kaggle-trainer is launched manually
from Mac for V19 runs.
Re-enable in-Space launchers by setting LOW_MEM=0 (Space variable)
once we upgrade to a paid tier (cpu-upgrade β₯ 32 GB) or migrate to a
larger anchor like OCI A1.Flex 24 GB ARM.
|
@@ -15,6 +15,13 @@ set -x
|
|
| 15 |
# Echo stdout so HF run-logs see progress (safe steps before .env is loaded)
|
| 16 |
exec > >(tee -a "$LOG_DIR/boot.log") 2>&1
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
# ββ 1. Persistent data β symlink state subdirs to /data (HF persistent mount) ββ
|
| 19 |
# bin/ is NOT persisted (baked into image, refreshed on every push).
|
| 20 |
# Persisted: state (DBs), logs, memory, skills, sessions, training pairs,
|
|
@@ -86,34 +93,38 @@ if [[ -d "$DATA" ]] && [[ -w "$DATA" ]]; then
|
|
| 86 |
fi
|
| 87 |
fi
|
| 88 |
|
| 89 |
-
# ββ
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
echo "[$(date +%H:%M:%S)] boot-time lightning-trainer kicked off (H200 4hr quota)" >> "$LOG_DIR/boot.log"
|
| 110 |
-
|
| 111 |
-
# ββ BOOT-TIME dataset-mirror β bulk-clone top community SFT mixes ββββββ
|
| 112 |
-
# Far faster than streaming-and-normalize β 1 commit per parquet file
|
| 113 |
-
# = millions of pairs landing as raw mirrors/<slug>/<file>. Idempotent
|
| 114 |
-
# via stamp file so we don't re-mirror what's already been pulled.
|
| 115 |
-
nohup bash "${HOME}/.surrogate/bin/dataset-mirror.sh" >> "$LOG_DIR/dataset-mirror.log" 2>&1 &
|
| 116 |
-
echo "[$(date +%H:%M:%S)] boot-time dataset-mirror kicked off (30 community sources)" >> "$LOG_DIR/boot.log"
|
| 117 |
|
| 118 |
echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
|
| 119 |
else
|
|
|
|
| 15 |
# Echo stdout so HF run-logs see progress (safe steps before .env is loaded)
|
| 16 |
exec > >(tee -a "$LOG_DIR/boot.log") 2>&1
|
| 17 |
|
| 18 |
+
# ββ Memory mode (must be set BEFORE any reference; we use `set -u`) βββββββ
|
| 19 |
+
# CPU-Basic Space = 16 GB cap. With LOW_MEM=1 we skip the heavy harvest
|
| 20 |
+
# launchers (dataset-enrich, dataset-mirror, kaggle-trainer, lightning-trainer,
|
| 21 |
+
# dedup-bootstrap) β those run on GCP daemons instead. Set LOW_MEM=0 only
|
| 22 |
+
# on a paid Space tier (cpu-upgrade β₯32 GB).
|
| 23 |
+
LOW_MEM="${LOW_MEM:-1}"
|
| 24 |
+
|
| 25 |
# ββ 1. Persistent data β symlink state subdirs to /data (HF persistent mount) ββ
|
| 26 |
# bin/ is NOT persisted (baked into image, refreshed on every push).
|
| 27 |
# Persisted: state (DBs), logs, memory, skills, sessions, training pairs,
|
|
|
|
| 93 |
fi
|
| 94 |
fi
|
| 95 |
|
| 96 |
+
# ββ Heavy harvest launchers β only on HIGH_MEM (LOW_MEM=0) βββββββββββ
|
| 97 |
+
# On CPU-Basic (16 GB cap) launching 5 background bash + uvicorn + 5 harvest
|
| 98 |
+
# workers blew through the cap and HF auto-killed the container ~5 min after
|
| 99 |
+
# boot. These launchers are now scheduled on GCP via hermes-scheduler-daemon
|
| 100 |
+
# (entries in data/hermes-jobs.json) so harvest still runs β just not from
|
| 101 |
+
# inside the Space's RAM. Re-enable in-Space by setting LOW_MEM=0 once we
|
| 102 |
+
# upgrade to a β₯32 GB tier.
|
| 103 |
+
if [[ "$LOW_MEM" != "1" ]]; then
|
| 104 |
+
# ββ One-time central dedup bootstrap from existing data ββββββββββ
|
| 105 |
+
if [[ ! -f "${HOME}/.surrogate/.dedup-bootstrap-done" ]]; then
|
| 106 |
+
echo "[$(date +%H:%M:%S)] running central dedup bootstrap (one-time)" >> "$LOG_DIR/boot.log"
|
| 107 |
+
nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
|
| 108 |
+
fi
|
| 109 |
+
|
| 110 |
+
# ββ BOOT-TIME enrich kickoff (trigger immediate pull, don't wait for cron)
|
| 111 |
+
nohup bash "${HOME}/.surrogate/bin/dataset-enrich.sh" >> "$LOG_DIR/dataset-enrich.log" 2>&1 &
|
| 112 |
+
echo "[$(date +%H:%M:%S)] boot-time dataset-enrich kicked off" >> "$LOG_DIR/boot.log"
|
| 113 |
|
| 114 |
+
# ββ BOOT-TIME kaggle-trainer kickoff (don't wait for 90-min cron) β
|
| 115 |
+
nohup bash "${HOME}/.surrogate/bin/kaggle-trainer.sh" >> "$LOG_DIR/kaggle-trainer.log" 2>&1 &
|
| 116 |
+
echo "[$(date +%H:%M:%S)] boot-time kaggle-trainer kicked off" >> "$LOG_DIR/boot.log"
|
| 117 |
+
|
| 118 |
+
# ββ BOOT-TIME lightning-trainer kickoff β H200 4 hr free for big model
|
| 119 |
+
nohup bash "${HOME}/.surrogate/bin/lightning-trainer.sh" >> "$LOG_DIR/lightning-trainer.log" 2>&1 &
|
| 120 |
+
echo "[$(date +%H:%M:%S)] boot-time lightning-trainer kicked off (H200 4hr quota)" >> "$LOG_DIR/boot.log"
|
| 121 |
+
|
| 122 |
+
# ββ BOOT-TIME dataset-mirror β bulk-clone top community SFT mixes β
|
| 123 |
+
nohup bash "${HOME}/.surrogate/bin/dataset-mirror.sh" >> "$LOG_DIR/dataset-mirror.log" 2>&1 &
|
| 124 |
+
echo "[$(date +%H:%M:%S)] boot-time dataset-mirror kicked off (30 community sources)" >> "$LOG_DIR/boot.log"
|
| 125 |
+
else
|
| 126 |
+
echo "[$(date +%H:%M:%S)] LOW_MEM=1 β skipped 5 heavy harvest launchers (delegated to GCP daemons)" >> "$LOG_DIR/boot.log"
|
| 127 |
+
fi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
|
| 130 |
else
|