Spaces:

Bani57
/

website

Running

Andrej Janchevski commited on 13 days ago

Commit

5375c2e

1 Parent(s): ce3d8f2

fix(deploy): harden container runtime and unify research / checkpoint roots

Multiple production-launch fixes that fall under "make the container
actually start and stay up reliably".

settings.py + Dockerfile: unify CHECKPOINTS_ROOT with RESEARCH_ROOT
(/app/research). The earlier split (CHECKPOINTS_ROOT=/app/checkpoints
on a named volume) caused a path mismatch — configs and Loader caches
live with the bundled research code under RESEARCH_ROOT, while
HF-Hub-downloaded weights landed in CHECKPOINTS_ROOT, breaking the
COINs experiment loader's lookups (e.g. configs/freebase.yml).
Snapshot_download now writes alongside the bundled tree; on paid HF
Spaces with persistent storage, override CHECKPOINTS_ROOT to /data.
Also moves the RESEARCH_ROOT definition above the sys.path additions
so the in-image /app/research path is honoured.

entrypoint.sh: switch gunicorn to one preloaded worker with four
threads and a 1800s timeout. --preload runs ModelRegistry.initialize
in the master before forking, so workers inherit a ready state via
copy-on-write and gunicorn's silent-time timeout doesn't fire mid
init (previous --workers 2 --timeout 300 caused worker-boot OOM and
respawn loops). Single worker because ModelRegistry holds multi-GB
of state and _inference_lock serialises inference globally — a
second worker would only ever queue on the same lock.

entrypoint.sh + docker-compose.yml + .env.example: unset HF_TOKEN if
it's set but empty. Compose's ${HF_TOKEN:-} default produces an
empty string in the container, which huggingface_hub turns into a
malformed 'Bearer ' header and httpx rejects. unset short-circuits
this for public repos.

docker-compose.yml: add shm_size: "2gb" so PyTorch shared-memory
paths don't blow up under Docker's 64 MB /dev/shm default. Drops
the named checkpoints volume since CHECKPOINTS_ROOT is now inside
the image's research tree (mixing image-baked files with a fresh
volume mount hides bundled configs and Loader caches).

.dockerignore: bundle COINs Loader caches (results/**/*.npz|gz)
back into the image (~10 MB across 35 files). Without them, every
cold start recomputes Leiden community detection and neighbour
sampling for all three datasets — ~15 min of CPU and a peak-RAM
spike that OOMs the worker on tight memory budgets.

Dockerfile: pre-create /app/research with mambauser ownership for
when CHECKPOINTS_ROOT is overridden to a fresh writable volume; pip
install karateclub --no-deps to bypass its bogus numpy<1.23 cap; ENV
HF_HUB_ENABLE_HF_TRANSFER=1 for faster cold-start downloads.

.gitignore: ignore .env, hf upload-large-folder's .huggingface/
state directory and built artefacts (staticfiles/, dist/).

Bani57 casing applied throughout — HF Hub's namespace check is
case-sensitive and the account is registered as "Bani57".

Files changed (7) hide show

.dockerignore +11 -4
.env.example +9 -3
.gitignore +3 -0
Dockerfile +18 -5
docker-compose.yml +16 -8
entrypoint.sh +27 -6
src/backend/research_api/settings.py +13 -9

.dockerignore CHANGED Viewed

@@ -11,17 +11,24 @@ node_modules
 src/frontend/node_modules
 src/frontend/dist
-# Local checkpoints — pulled at runtime from HF Hub instead.
 src/research/**/checkpoints/*
 src/research/**/results/**/*.tar
 src/research/**/results/**/*.ckpt
-src/research/**/results/**/*.npz
-src/research/**/results/**/*.gz
-# Cached precomputations (regenerated on first boot if needed).
 src/research/**/*.npz
 src/research/**/*.gz
 # Local dev artefacts that don't belong in the image.
 docs
 plans

 src/frontend/node_modules
 src/frontend/dist
+# Local model weights — pulled at runtime from HF Hub into CHECKPOINTS_ROOT.
 src/research/**/checkpoints/*
 src/research/**/results/**/*.tar
 src/research/**/results/**/*.ckpt
+# Misc precomputations the research code may emit.  Excluded by default to
+# keep the image lean.
 src/research/**/*.npz
 src/research/**/*.gz
+# Exception: bundle the COINs Loader caches (community / node neighbour
+# samples and graph metric tensors, ~10 MB across 35 files).  Without these
+# on disk, every cold start recomputes Leiden community detection and
+# neighbour sampling for all three datasets — ~15 min of CPU and a peak RAM
+# spike that OOMs the worker on tight memory budgets.
+!src/research/**/results/**/*.npz
+!src/research/**/results/**/*.gz
 # Local dev artefacts that don't belong in the image.
 docs
 plans

.env.example CHANGED Viewed

@@ -1,6 +1,6 @@
 # Local environment for `docker compose up`.  Copy to `.env` (which is
 # gitignored) and fill in the values.  In production these are configured as
-# Space secrets at https://huggingface.co/spaces/bani57/website/settings.
 # REQUIRED.  Generate with:
 #   python -c "import secrets; print(secrets.token_urlsafe(50))"
@@ -9,8 +9,14 @@ DJANGO_SECRET_KEY=replace-me
 # Optional overrides (defaults shown).
 DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1
 CORS_ALLOWED_ORIGINS=http://localhost:7860
-HF_CHECKPOINTS_REPO=bani57/checkpoints
 TORCH_DEVICE=cpu
-# Only needed if the checkpoint repo is private.
 # HF_TOKEN=hf_...

 # Local environment for `docker compose up`.  Copy to `.env` (which is
 # gitignored) and fill in the values.  In production these are configured as
+# Space secrets at https://huggingface.co/spaces/Bani57/website/settings.
 # REQUIRED.  Generate with:
 #   python -c "import secrets; print(secrets.token_urlsafe(50))"
 # Optional overrides (defaults shown).
 DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1
 CORS_ALLOWED_ORIGINS=http://localhost:7860
+HF_CHECKPOINTS_REPO=Bani57/checkpoints
 TORCH_DEVICE=cpu
+# Optional: a read token (https://huggingface.co/settings/tokens) lifts
+# anonymous rate limits and roughly triples cold-start download speed.
+# Only required when the checkpoint repo is private.
 # HF_TOKEN=hf_...
+# Rust-accelerated transfer is enabled by default in the image.  Set to 0 to
+# disable (e.g. when debugging the pure-Python download path).
+# HF_HUB_ENABLE_HF_TRANSFER=1

.gitignore CHANGED Viewed

@@ -16,3 +16,6 @@ src/backend/.checkpoint_staging/
 # Docker build artefacts that don't belong in source control.
 src/backend/staticfiles/
 src/backend/dist/

 # Docker build artefacts that don't belong in source control.
 src/backend/staticfiles/
 src/backend/dist/
+# huggingface_hub upload-large-folder resume-tracking state.
+.huggingface/

Dockerfile CHANGED Viewed

@@ -29,6 +29,10 @@ RUN pip install --no-cache-dir \
       --extra-index-url https://download.pytorch.org/whl/cu118 \
       -r /tmp/requirements.txt
 # Application code + research repo + built SPA
 WORKDIR /app
 COPY --chown=$MAMBA_USER:$MAMBA_USER src/backend /app/backend
@@ -36,15 +40,24 @@ COPY --chown=$MAMBA_USER:$MAMBA_USER src/research /app/research
 COPY --chown=$MAMBA_USER:$MAMBA_USER --from=frontend /app/dist /app/backend/dist
 COPY --chown=$MAMBA_USER:$MAMBA_USER entrypoint.sh /entrypoint.sh
-# Settings need RESEARCH_ROOT and CHECKPOINTS_ROOT to point at the in-image
-# research tree.  CHECKPOINTS_ROOT lives on a writable path so snapshot_download
-# can populate it at boot.
 ENV RESEARCH_ROOT=/app/research \
-    CHECKPOINTS_ROOT=/app/checkpoints \
     SPA_DIST_DIR=/app/backend/dist \
     DJANGO_SETTINGS_MODULE=research_api.settings \
     PYTHONUNBUFFERED=1 \
-    PORT=7860
 WORKDIR /app/backend
 RUN python manage.py collectstatic --noinput

       --extra-index-url https://download.pytorch.org/whl/cu118 \
       -r /tmp/requirements.txt
+# karateclub 1.3.3 falsely caps numpy<1.23; install without its setup.py
+# resolving that, real runtime deps are already covered by requirements.txt.
+RUN pip install --no-cache-dir --no-deps karateclub==1.3.3
 # Application code + research repo + built SPA
 WORKDIR /app
 COPY --chown=$MAMBA_USER:$MAMBA_USER src/backend /app/backend
 COPY --chown=$MAMBA_USER:$MAMBA_USER --from=frontend /app/dist /app/backend/dist
 COPY --chown=$MAMBA_USER:$MAMBA_USER entrypoint.sh /entrypoint.sh
+# /app/research is the unified root for code, configs, Loader caches and
+# downloaded weights.  Already owned by mambauser via the COPY --chown above
+# — snapshot_download writes new files into it at runtime.
+# Settings.py derives every research path (configs, code, results, weights)
+# from these two roots.  We unify them — snapshot_download writes new
+# checkpoint files into the research tree alongside the bundled code,
+# configs and Loader caches.  Any mambauser-owned file under /app/research
+# is writable at runtime; the read-only image layer holds the originals.
+# Override CHECKPOINTS_ROOT (e.g. to /data/checkpoints on a paid HF Space
+# with persistent storage) to point downloads elsewhere.
 ENV RESEARCH_ROOT=/app/research \
+    CHECKPOINTS_ROOT=/app/research \
     SPA_DIST_DIR=/app/backend/dist \
     DJANGO_SETTINGS_MODULE=research_api.settings \
     PYTHONUNBUFFERED=1 \
+    PORT=7860 \
+    HF_HUB_ENABLE_HF_TRANSFER=1
 WORKDIR /app/backend
 RUN python manage.py collectstatic --noinput

docker-compose.yml CHANGED Viewed

@@ -9,6 +9,11 @@ services:
     image: bani57-website:local
     ports:
       - "7860:7860"
     environment:
       DJANGO_DEBUG: "False"
       # Required.  Generate locally with:
@@ -17,13 +22,16 @@ services:
       DJANGO_SECRET_KEY: ${DJANGO_SECRET_KEY:?DJANGO_SECRET_KEY must be set in .env or your shell}
       DJANGO_ALLOWED_HOSTS: ${DJANGO_ALLOWED_HOSTS:-localhost,127.0.0.1}
       CORS_ALLOWED_ORIGINS: ${CORS_ALLOWED_ORIGINS:-http://localhost:7860}
-      HF_CHECKPOINTS_REPO: ${HF_CHECKPOINTS_REPO:-bani57/checkpoints}
       HF_TOKEN: ${HF_TOKEN:-}
-      CHECKPOINTS_ROOT: /app/checkpoints
       TORCH_DEVICE: ${TORCH_DEVICE:-cpu}
-    volumes:
-      # Persist the downloaded checkpoints between runs so we don't re-pull
-      # ~5.4 GB on every `docker compose up`.
-      - checkpoints:/app/checkpoints
-volumes:
-  checkpoints:

     image: bani57-website:local
     ports:
       - "7860:7860"
+    # Larger /dev/shm than Docker's 64 MB default — torch's share_memory()
+    # falls back to it for tensor storage.  The registry monkey-patches the
+    # COINs share_memory call to a no-op anyway, but bumping shm_size keeps
+    # any other PyTorch shared-memory code path from blowing up locally.
+    shm_size: "2gb"
     environment:
       DJANGO_DEBUG: "False"
       # Required.  Generate locally with:
       DJANGO_SECRET_KEY: ${DJANGO_SECRET_KEY:?DJANGO_SECRET_KEY must be set in .env or your shell}
       DJANGO_ALLOWED_HOSTS: ${DJANGO_ALLOWED_HOSTS:-localhost,127.0.0.1}
       CORS_ALLOWED_ORIGINS: ${CORS_ALLOWED_ORIGINS:-http://localhost:7860}
+      HF_CHECKPOINTS_REPO: ${HF_CHECKPOINTS_REPO:-Bani57/checkpoints}
+      # Only needed for a private checkpoint repo.  An empty value is safe —
+      # entrypoint.sh unsets HF_TOKEN when empty so huggingface_hub doesn't
+      # build a malformed 'Bearer ' header.
       HF_TOKEN: ${HF_TOKEN:-}
+      # CHECKPOINTS_ROOT defaults to RESEARCH_ROOT (/app/research) in the
+      # image — snapshot_download writes weights alongside the bundled
+      # configs and Loader caches so every research-code path resolves
+      # against a single tree.  No named volume because mixing
+      # image-baked files with a fresh volume mount hides the bundled
+      # caches and configs.  ~5.4 GB is re-pulled on each `compose up`,
+      # which takes ~3 min with hf_transfer enabled.
       TORCH_DEVICE: ${TORCH_DEVICE:-cpu}

entrypoint.sh CHANGED Viewed

@@ -7,10 +7,17 @@
 # file is already present and snapshot_download is a no-op.
 set -euo pipefail
-: "${HF_CHECKPOINTS_REPO:=bani57/checkpoints}"
 : "${CHECKPOINTS_ROOT:=/app/checkpoints}"
 : "${PORT:=7860}"
 mkdir -p "${CHECKPOINTS_ROOT}"
 echo "[entrypoint] Pre-warming checkpoints from ${HF_CHECKPOINTS_REPO} -> ${CHECKPOINTS_ROOT}"
@@ -18,22 +25,36 @@ python - <<'PY'
 import os
 from huggingface_hub import snapshot_download
 snapshot_download(
     repo_id=os.environ["HF_CHECKPOINTS_REPO"],
     repo_type="model",
     local_dir=os.environ["CHECKPOINTS_ROOT"],
-    local_dir_use_symlinks=False,
     max_workers=4,
-    token=os.environ.get("HF_TOKEN"),
 )
 print("[entrypoint] checkpoints ready")
 PY
 echo "[entrypoint] starting gunicorn on 0.0.0.0:${PORT}"
 exec gunicorn research_api.wsgi:application \
     --bind "0.0.0.0:${PORT}" \
-    --workers 2 \
-    --threads 2 \
-    --timeout 300 \
     --access-logfile - \
     --error-logfile -

 # file is already present and snapshot_download is a no-op.
 set -euo pipefail
+: "${HF_CHECKPOINTS_REPO:=Bani57/checkpoints}"
 : "${CHECKPOINTS_ROOT:=/app/checkpoints}"
 : "${PORT:=7860}"
+# An empty HF_TOKEN tricks huggingface_hub into emitting a malformed
+# 'Bearer ' auth header (httpx rejects it).  For public repos no token is
+# needed; drop the variable entirely if it's set but empty.
+if [ -z "${HF_TOKEN:-}" ]; then
+    unset HF_TOKEN
+fi
 mkdir -p "${CHECKPOINTS_ROOT}"
 echo "[entrypoint] Pre-warming checkpoints from ${HF_CHECKPOINTS_REPO} -> ${CHECKPOINTS_ROOT}"
 import os
 from huggingface_hub import snapshot_download
+token = os.environ.get("HF_TOKEN") or None
 snapshot_download(
     repo_id=os.environ["HF_CHECKPOINTS_REPO"],
     repo_type="model",
     local_dir=os.environ["CHECKPOINTS_ROOT"],
     max_workers=4,
+    token=token,
 )
 print("[entrypoint] checkpoints ready")
 PY
 echo "[entrypoint] starting gunicorn on 0.0.0.0:${PORT}"
+# Single worker, multiple threads, preloaded app:
+#   * --preload runs Django setup (and ModelRegistry.initialize → COINs
+#     Loaders + sample subgraphs) ONCE in the gunicorn master before
+#     forking.  Without it, every worker boot goes through the same
+#     10–15-minute graph-metric computation and gunicorn's silent-time
+#     timeout (which fires during boot) kills the worker mid-init.
+#   * Single worker because the ModelRegistry holds multi-GB of state and
+#     ModelRegistry._inference_lock serializes inference globally anyway —
+#     a second worker would duplicate the memory without adding throughput.
+#   * Long --timeout protects the first inference request after a cold
+#     start, when lazy model loading + diffusion sampling can take minutes
+#     on free-tier CPU.
 exec gunicorn research_api.wsgi:application \
     --bind "0.0.0.0:${PORT}" \
+    --workers 1 \
+    --threads 4 \
+    --preload \
+    --timeout 1800 \
+    --graceful-timeout 30 \
     --access-logfile - \
     --error-logfile -

src/backend/research_api/settings.py CHANGED Viewed

@@ -3,13 +3,18 @@ import sys
 from pathlib import Path
 BASE_DIR = Path(__file__).resolve().parent.parent
-PROJECT_ROOT = BASE_DIR.parent.parent  # Website root
 # Add research repos to sys.path so their modules can be imported
-_COINS_KG_ROOT = str(PROJECT_ROOT / "src" / "research" / "COINs-KGGeneration")
-_DIGRESS_KG_SRC = str(PROJECT_ROOT / "src" / "research" / "COINs-KGGeneration" / "graph_generation" / "src")
-_MULTIPROXAN_ROOT = str(PROJECT_ROOT / "src" / "research" / "MultiProxAn")
-_MULTIPROXAN_SRC = str(PROJECT_ROOT / "src" / "research" / "MultiProxAn" / "src")
 for _path in (_COINS_KG_ROOT, _DIGRESS_KG_SRC, _MULTIPROXAN_ROOT, _MULTIPROXAN_SRC):
     if _path not in sys.path:
         sys.path.insert(0, _path)
@@ -72,10 +77,9 @@ if not DEBUG:
     SESSION_COOKIE_SECURE = True
     CSRF_TRUSTED_ORIGINS = [o for o in CORS_ALLOWED_ORIGINS if o.startswith("https://")]
-# Research code root.  Inside the container the checkpoints live alongside the
-# research code under /app/research; in dev they live in the repo at
-# src/research/.  CHECKPOINTS_ROOT is what huggingface_hub will populate.
-RESEARCH_ROOT = Path(os.environ.get("RESEARCH_ROOT", PROJECT_ROOT / "src" / "research"))
 CHECKPOINTS_ROOT = Path(os.environ.get("CHECKPOINTS_ROOT", RESEARCH_ROOT))
 # Research code paths

 from pathlib import Path
 BASE_DIR = Path(__file__).resolve().parent.parent
+PROJECT_ROOT = BASE_DIR.parent.parent  # Website root in dev; ignored in the container
+# Research code root.  In the deployment image this is /app/research; in dev
+# it defaults to <repo>/src/research.  Defined early because the sys.path
+# additions below depend on it.
+RESEARCH_ROOT = Path(os.environ.get("RESEARCH_ROOT", PROJECT_ROOT / "src" / "research"))
 # Add research repos to sys.path so their modules can be imported
+_COINS_KG_ROOT = str(RESEARCH_ROOT / "COINs-KGGeneration")
+_DIGRESS_KG_SRC = str(RESEARCH_ROOT / "COINs-KGGeneration" / "graph_generation" / "src")
+_MULTIPROXAN_ROOT = str(RESEARCH_ROOT / "MultiProxAn")
+_MULTIPROXAN_SRC = str(RESEARCH_ROOT / "MultiProxAn" / "src")
 for _path in (_COINS_KG_ROOT, _DIGRESS_KG_SRC, _MULTIPROXAN_ROOT, _MULTIPROXAN_SRC):
     if _path not in sys.path:
         sys.path.insert(0, _path)
     SESSION_COOKIE_SECURE = True
     CSRF_TRUSTED_ORIGINS = [o for o in CORS_ALLOWED_ORIGINS if o.startswith("https://")]
+# Checkpoints land here at runtime via huggingface_hub.snapshot_download.
+# Defaults to RESEARCH_ROOT (defined above) for dev where weights sit alongside
+# the research code; the deployment image overrides to /app/checkpoints.
 CHECKPOINTS_ROOT = Path(os.environ.get("CHECKPOINTS_ROOT", RESEARCH_ROOT))
 # Research code paths