Andrej Janchevski commited on
Commit
5375c2e
Β·
1 Parent(s): ce3d8f2

fix(deploy): harden container runtime and unify research / checkpoint roots

Browse files

Multiple production-launch fixes that fall under "make the container
actually start and stay up reliably".

settings.py + Dockerfile: unify CHECKPOINTS_ROOT with RESEARCH_ROOT
(/app/research). The earlier split (CHECKPOINTS_ROOT=/app/checkpoints
on a named volume) caused a path mismatch β€” configs and Loader caches
live with the bundled research code under RESEARCH_ROOT, while
HF-Hub-downloaded weights landed in CHECKPOINTS_ROOT, breaking the
COINs experiment loader's lookups (e.g. configs/freebase.yml).
Snapshot_download now writes alongside the bundled tree; on paid HF
Spaces with persistent storage, override CHECKPOINTS_ROOT to /data.
Also moves the RESEARCH_ROOT definition above the sys.path additions
so the in-image /app/research path is honoured.

entrypoint.sh: switch gunicorn to one preloaded worker with four
threads and a 1800s timeout. --preload runs ModelRegistry.initialize
in the master before forking, so workers inherit a ready state via
copy-on-write and gunicorn's silent-time timeout doesn't fire mid
init (previous --workers 2 --timeout 300 caused worker-boot OOM and
respawn loops). Single worker because ModelRegistry holds multi-GB
of state and _inference_lock serialises inference globally β€” a
second worker would only ever queue on the same lock.

entrypoint.sh + docker-compose.yml + .env.example: unset HF_TOKEN if
it's set but empty. Compose's ${HF_TOKEN:-} default produces an
empty string in the container, which huggingface_hub turns into a
malformed 'Bearer ' header and httpx rejects. unset short-circuits
this for public repos.

docker-compose.yml: add shm_size: "2gb" so PyTorch shared-memory
paths don't blow up under Docker's 64 MB /dev/shm default. Drops
the named checkpoints volume since CHECKPOINTS_ROOT is now inside
the image's research tree (mixing image-baked files with a fresh
volume mount hides bundled configs and Loader caches).

.dockerignore: bundle COINs Loader caches (results/**/*.npz|gz)
back into the image (~10 MB across 35 files). Without them, every
cold start recomputes Leiden community detection and neighbour
sampling for all three datasets β€” ~15 min of CPU and a peak-RAM
spike that OOMs the worker on tight memory budgets.

Dockerfile: pre-create /app/research with mambauser ownership for
when CHECKPOINTS_ROOT is overridden to a fresh writable volume; pip
install karateclub --no-deps to bypass its bogus numpy<1.23 cap; ENV
HF_HUB_ENABLE_HF_TRANSFER=1 for faster cold-start downloads.

.gitignore: ignore .env, hf upload-large-folder's .huggingface/
state directory and built artefacts (staticfiles/, dist/).

Bani57 casing applied throughout β€” HF Hub's namespace check is
case-sensitive and the account is registered as "Bani57".

.dockerignore CHANGED
@@ -11,17 +11,24 @@ node_modules
11
  src/frontend/node_modules
12
  src/frontend/dist
13
 
14
- # Local checkpoints β€” pulled at runtime from HF Hub instead.
15
  src/research/**/checkpoints/*
16
  src/research/**/results/**/*.tar
17
  src/research/**/results/**/*.ckpt
18
- src/research/**/results/**/*.npz
19
- src/research/**/results/**/*.gz
20
 
21
- # Cached precomputations (regenerated on first boot if needed).
 
22
  src/research/**/*.npz
23
  src/research/**/*.gz
24
 
 
 
 
 
 
 
 
 
25
  # Local dev artefacts that don't belong in the image.
26
  docs
27
  plans
 
11
  src/frontend/node_modules
12
  src/frontend/dist
13
 
14
+ # Local model weights β€” pulled at runtime from HF Hub into CHECKPOINTS_ROOT.
15
  src/research/**/checkpoints/*
16
  src/research/**/results/**/*.tar
17
  src/research/**/results/**/*.ckpt
 
 
18
 
19
+ # Misc precomputations the research code may emit. Excluded by default to
20
+ # keep the image lean.
21
  src/research/**/*.npz
22
  src/research/**/*.gz
23
 
24
+ # Exception: bundle the COINs Loader caches (community / node neighbour
25
+ # samples and graph metric tensors, ~10 MB across 35 files). Without these
26
+ # on disk, every cold start recomputes Leiden community detection and
27
+ # neighbour sampling for all three datasets β€” ~15 min of CPU and a peak RAM
28
+ # spike that OOMs the worker on tight memory budgets.
29
+ !src/research/**/results/**/*.npz
30
+ !src/research/**/results/**/*.gz
31
+
32
  # Local dev artefacts that don't belong in the image.
33
  docs
34
  plans
.env.example CHANGED
@@ -1,6 +1,6 @@
1
  # Local environment for `docker compose up`. Copy to `.env` (which is
2
  # gitignored) and fill in the values. In production these are configured as
3
- # Space secrets at https://huggingface.co/spaces/bani57/website/settings.
4
 
5
  # REQUIRED. Generate with:
6
  # python -c "import secrets; print(secrets.token_urlsafe(50))"
@@ -9,8 +9,14 @@ DJANGO_SECRET_KEY=replace-me
9
  # Optional overrides (defaults shown).
10
  DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1
11
  CORS_ALLOWED_ORIGINS=http://localhost:7860
12
- HF_CHECKPOINTS_REPO=bani57/checkpoints
13
  TORCH_DEVICE=cpu
14
 
15
- # Only needed if the checkpoint repo is private.
 
 
16
  # HF_TOKEN=hf_...
 
 
 
 
 
1
  # Local environment for `docker compose up`. Copy to `.env` (which is
2
  # gitignored) and fill in the values. In production these are configured as
3
+ # Space secrets at https://huggingface.co/spaces/Bani57/website/settings.
4
 
5
  # REQUIRED. Generate with:
6
  # python -c "import secrets; print(secrets.token_urlsafe(50))"
 
9
  # Optional overrides (defaults shown).
10
  DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1
11
  CORS_ALLOWED_ORIGINS=http://localhost:7860
12
+ HF_CHECKPOINTS_REPO=Bani57/checkpoints
13
  TORCH_DEVICE=cpu
14
 
15
+ # Optional: a read token (https://huggingface.co/settings/tokens) lifts
16
+ # anonymous rate limits and roughly triples cold-start download speed.
17
+ # Only required when the checkpoint repo is private.
18
  # HF_TOKEN=hf_...
19
+
20
+ # Rust-accelerated transfer is enabled by default in the image. Set to 0 to
21
+ # disable (e.g. when debugging the pure-Python download path).
22
+ # HF_HUB_ENABLE_HF_TRANSFER=1
.gitignore CHANGED
@@ -16,3 +16,6 @@ src/backend/.checkpoint_staging/
16
  # Docker build artefacts that don't belong in source control.
17
  src/backend/staticfiles/
18
  src/backend/dist/
 
 
 
 
16
  # Docker build artefacts that don't belong in source control.
17
  src/backend/staticfiles/
18
  src/backend/dist/
19
+
20
+ # huggingface_hub upload-large-folder resume-tracking state.
21
+ .huggingface/
Dockerfile CHANGED
@@ -29,6 +29,10 @@ RUN pip install --no-cache-dir \
29
  --extra-index-url https://download.pytorch.org/whl/cu118 \
30
  -r /tmp/requirements.txt
31
 
 
 
 
 
32
  # Application code + research repo + built SPA
33
  WORKDIR /app
34
  COPY --chown=$MAMBA_USER:$MAMBA_USER src/backend /app/backend
@@ -36,15 +40,24 @@ COPY --chown=$MAMBA_USER:$MAMBA_USER src/research /app/research
36
  COPY --chown=$MAMBA_USER:$MAMBA_USER --from=frontend /app/dist /app/backend/dist
37
  COPY --chown=$MAMBA_USER:$MAMBA_USER entrypoint.sh /entrypoint.sh
38
 
39
- # Settings need RESEARCH_ROOT and CHECKPOINTS_ROOT to point at the in-image
40
- # research tree. CHECKPOINTS_ROOT lives on a writable path so snapshot_download
41
- # can populate it at boot.
 
 
 
 
 
 
 
 
42
  ENV RESEARCH_ROOT=/app/research \
43
- CHECKPOINTS_ROOT=/app/checkpoints \
44
  SPA_DIST_DIR=/app/backend/dist \
45
  DJANGO_SETTINGS_MODULE=research_api.settings \
46
  PYTHONUNBUFFERED=1 \
47
- PORT=7860
 
48
 
49
  WORKDIR /app/backend
50
  RUN python manage.py collectstatic --noinput
 
29
  --extra-index-url https://download.pytorch.org/whl/cu118 \
30
  -r /tmp/requirements.txt
31
 
32
+ # karateclub 1.3.3 falsely caps numpy<1.23; install without its setup.py
33
+ # resolving that, real runtime deps are already covered by requirements.txt.
34
+ RUN pip install --no-cache-dir --no-deps karateclub==1.3.3
35
+
36
  # Application code + research repo + built SPA
37
  WORKDIR /app
38
  COPY --chown=$MAMBA_USER:$MAMBA_USER src/backend /app/backend
 
40
  COPY --chown=$MAMBA_USER:$MAMBA_USER --from=frontend /app/dist /app/backend/dist
41
  COPY --chown=$MAMBA_USER:$MAMBA_USER entrypoint.sh /entrypoint.sh
42
 
43
+ # /app/research is the unified root for code, configs, Loader caches and
44
+ # downloaded weights. Already owned by mambauser via the COPY --chown above
45
+ # β€” snapshot_download writes new files into it at runtime.
46
+
47
+ # Settings.py derives every research path (configs, code, results, weights)
48
+ # from these two roots. We unify them β€” snapshot_download writes new
49
+ # checkpoint files into the research tree alongside the bundled code,
50
+ # configs and Loader caches. Any mambauser-owned file under /app/research
51
+ # is writable at runtime; the read-only image layer holds the originals.
52
+ # Override CHECKPOINTS_ROOT (e.g. to /data/checkpoints on a paid HF Space
53
+ # with persistent storage) to point downloads elsewhere.
54
  ENV RESEARCH_ROOT=/app/research \
55
+ CHECKPOINTS_ROOT=/app/research \
56
  SPA_DIST_DIR=/app/backend/dist \
57
  DJANGO_SETTINGS_MODULE=research_api.settings \
58
  PYTHONUNBUFFERED=1 \
59
+ PORT=7860 \
60
+ HF_HUB_ENABLE_HF_TRANSFER=1
61
 
62
  WORKDIR /app/backend
63
  RUN python manage.py collectstatic --noinput
docker-compose.yml CHANGED
@@ -9,6 +9,11 @@ services:
9
  image: bani57-website:local
10
  ports:
11
  - "7860:7860"
 
 
 
 
 
12
  environment:
13
  DJANGO_DEBUG: "False"
14
  # Required. Generate locally with:
@@ -17,13 +22,16 @@ services:
17
  DJANGO_SECRET_KEY: ${DJANGO_SECRET_KEY:?DJANGO_SECRET_KEY must be set in .env or your shell}
18
  DJANGO_ALLOWED_HOSTS: ${DJANGO_ALLOWED_HOSTS:-localhost,127.0.0.1}
19
  CORS_ALLOWED_ORIGINS: ${CORS_ALLOWED_ORIGINS:-http://localhost:7860}
20
- HF_CHECKPOINTS_REPO: ${HF_CHECKPOINTS_REPO:-bani57/checkpoints}
 
 
 
21
  HF_TOKEN: ${HF_TOKEN:-}
22
- CHECKPOINTS_ROOT: /app/checkpoints
 
 
 
 
 
 
23
  TORCH_DEVICE: ${TORCH_DEVICE:-cpu}
24
- volumes:
25
- # Persist the downloaded checkpoints between runs so we don't re-pull
26
- # ~5.4 GB on every `docker compose up`.
27
- - checkpoints:/app/checkpoints
28
- volumes:
29
- checkpoints:
 
9
  image: bani57-website:local
10
  ports:
11
  - "7860:7860"
12
+ # Larger /dev/shm than Docker's 64 MB default β€” torch's share_memory()
13
+ # falls back to it for tensor storage. The registry monkey-patches the
14
+ # COINs share_memory call to a no-op anyway, but bumping shm_size keeps
15
+ # any other PyTorch shared-memory code path from blowing up locally.
16
+ shm_size: "2gb"
17
  environment:
18
  DJANGO_DEBUG: "False"
19
  # Required. Generate locally with:
 
22
  DJANGO_SECRET_KEY: ${DJANGO_SECRET_KEY:?DJANGO_SECRET_KEY must be set in .env or your shell}
23
  DJANGO_ALLOWED_HOSTS: ${DJANGO_ALLOWED_HOSTS:-localhost,127.0.0.1}
24
  CORS_ALLOWED_ORIGINS: ${CORS_ALLOWED_ORIGINS:-http://localhost:7860}
25
+ HF_CHECKPOINTS_REPO: ${HF_CHECKPOINTS_REPO:-Bani57/checkpoints}
26
+ # Only needed for a private checkpoint repo. An empty value is safe β€”
27
+ # entrypoint.sh unsets HF_TOKEN when empty so huggingface_hub doesn't
28
+ # build a malformed 'Bearer ' header.
29
  HF_TOKEN: ${HF_TOKEN:-}
30
+ # CHECKPOINTS_ROOT defaults to RESEARCH_ROOT (/app/research) in the
31
+ # image β€” snapshot_download writes weights alongside the bundled
32
+ # configs and Loader caches so every research-code path resolves
33
+ # against a single tree. No named volume because mixing
34
+ # image-baked files with a fresh volume mount hides the bundled
35
+ # caches and configs. ~5.4 GB is re-pulled on each `compose up`,
36
+ # which takes ~3 min with hf_transfer enabled.
37
  TORCH_DEVICE: ${TORCH_DEVICE:-cpu}
 
 
 
 
 
 
entrypoint.sh CHANGED
@@ -7,10 +7,17 @@
7
  # file is already present and snapshot_download is a no-op.
8
  set -euo pipefail
9
 
10
- : "${HF_CHECKPOINTS_REPO:=bani57/checkpoints}"
11
  : "${CHECKPOINTS_ROOT:=/app/checkpoints}"
12
  : "${PORT:=7860}"
13
 
 
 
 
 
 
 
 
14
  mkdir -p "${CHECKPOINTS_ROOT}"
15
 
16
  echo "[entrypoint] Pre-warming checkpoints from ${HF_CHECKPOINTS_REPO} -> ${CHECKPOINTS_ROOT}"
@@ -18,22 +25,36 @@ python - <<'PY'
18
  import os
19
  from huggingface_hub import snapshot_download
20
 
 
21
  snapshot_download(
22
  repo_id=os.environ["HF_CHECKPOINTS_REPO"],
23
  repo_type="model",
24
  local_dir=os.environ["CHECKPOINTS_ROOT"],
25
- local_dir_use_symlinks=False,
26
  max_workers=4,
27
- token=os.environ.get("HF_TOKEN"),
28
  )
29
  print("[entrypoint] checkpoints ready")
30
  PY
31
 
32
  echo "[entrypoint] starting gunicorn on 0.0.0.0:${PORT}"
 
 
 
 
 
 
 
 
 
 
 
 
33
  exec gunicorn research_api.wsgi:application \
34
  --bind "0.0.0.0:${PORT}" \
35
- --workers 2 \
36
- --threads 2 \
37
- --timeout 300 \
 
 
38
  --access-logfile - \
39
  --error-logfile -
 
7
  # file is already present and snapshot_download is a no-op.
8
  set -euo pipefail
9
 
10
+ : "${HF_CHECKPOINTS_REPO:=Bani57/checkpoints}"
11
  : "${CHECKPOINTS_ROOT:=/app/checkpoints}"
12
  : "${PORT:=7860}"
13
 
14
+ # An empty HF_TOKEN tricks huggingface_hub into emitting a malformed
15
+ # 'Bearer ' auth header (httpx rejects it). For public repos no token is
16
+ # needed; drop the variable entirely if it's set but empty.
17
+ if [ -z "${HF_TOKEN:-}" ]; then
18
+ unset HF_TOKEN
19
+ fi
20
+
21
  mkdir -p "${CHECKPOINTS_ROOT}"
22
 
23
  echo "[entrypoint] Pre-warming checkpoints from ${HF_CHECKPOINTS_REPO} -> ${CHECKPOINTS_ROOT}"
 
25
  import os
26
  from huggingface_hub import snapshot_download
27
 
28
+ token = os.environ.get("HF_TOKEN") or None
29
  snapshot_download(
30
  repo_id=os.environ["HF_CHECKPOINTS_REPO"],
31
  repo_type="model",
32
  local_dir=os.environ["CHECKPOINTS_ROOT"],
 
33
  max_workers=4,
34
+ token=token,
35
  )
36
  print("[entrypoint] checkpoints ready")
37
  PY
38
 
39
  echo "[entrypoint] starting gunicorn on 0.0.0.0:${PORT}"
40
+ # Single worker, multiple threads, preloaded app:
41
+ # * --preload runs Django setup (and ModelRegistry.initialize β†’ COINs
42
+ # Loaders + sample subgraphs) ONCE in the gunicorn master before
43
+ # forking. Without it, every worker boot goes through the same
44
+ # 10–15-minute graph-metric computation and gunicorn's silent-time
45
+ # timeout (which fires during boot) kills the worker mid-init.
46
+ # * Single worker because the ModelRegistry holds multi-GB of state and
47
+ # ModelRegistry._inference_lock serializes inference globally anyway β€”
48
+ # a second worker would duplicate the memory without adding throughput.
49
+ # * Long --timeout protects the first inference request after a cold
50
+ # start, when lazy model loading + diffusion sampling can take minutes
51
+ # on free-tier CPU.
52
  exec gunicorn research_api.wsgi:application \
53
  --bind "0.0.0.0:${PORT}" \
54
+ --workers 1 \
55
+ --threads 4 \
56
+ --preload \
57
+ --timeout 1800 \
58
+ --graceful-timeout 30 \
59
  --access-logfile - \
60
  --error-logfile -
src/backend/research_api/settings.py CHANGED
@@ -3,13 +3,18 @@ import sys
3
  from pathlib import Path
4
 
5
  BASE_DIR = Path(__file__).resolve().parent.parent
6
- PROJECT_ROOT = BASE_DIR.parent.parent # Website root
 
 
 
 
 
7
 
8
  # Add research repos to sys.path so their modules can be imported
9
- _COINS_KG_ROOT = str(PROJECT_ROOT / "src" / "research" / "COINs-KGGeneration")
10
- _DIGRESS_KG_SRC = str(PROJECT_ROOT / "src" / "research" / "COINs-KGGeneration" / "graph_generation" / "src")
11
- _MULTIPROXAN_ROOT = str(PROJECT_ROOT / "src" / "research" / "MultiProxAn")
12
- _MULTIPROXAN_SRC = str(PROJECT_ROOT / "src" / "research" / "MultiProxAn" / "src")
13
  for _path in (_COINS_KG_ROOT, _DIGRESS_KG_SRC, _MULTIPROXAN_ROOT, _MULTIPROXAN_SRC):
14
  if _path not in sys.path:
15
  sys.path.insert(0, _path)
@@ -72,10 +77,9 @@ if not DEBUG:
72
  SESSION_COOKIE_SECURE = True
73
  CSRF_TRUSTED_ORIGINS = [o for o in CORS_ALLOWED_ORIGINS if o.startswith("https://")]
74
 
75
- # Research code root. Inside the container the checkpoints live alongside the
76
- # research code under /app/research; in dev they live in the repo at
77
- # src/research/. CHECKPOINTS_ROOT is what huggingface_hub will populate.
78
- RESEARCH_ROOT = Path(os.environ.get("RESEARCH_ROOT", PROJECT_ROOT / "src" / "research"))
79
  CHECKPOINTS_ROOT = Path(os.environ.get("CHECKPOINTS_ROOT", RESEARCH_ROOT))
80
 
81
  # Research code paths
 
3
  from pathlib import Path
4
 
5
  BASE_DIR = Path(__file__).resolve().parent.parent
6
+ PROJECT_ROOT = BASE_DIR.parent.parent # Website root in dev; ignored in the container
7
+
8
+ # Research code root. In the deployment image this is /app/research; in dev
9
+ # it defaults to <repo>/src/research. Defined early because the sys.path
10
+ # additions below depend on it.
11
+ RESEARCH_ROOT = Path(os.environ.get("RESEARCH_ROOT", PROJECT_ROOT / "src" / "research"))
12
 
13
  # Add research repos to sys.path so their modules can be imported
14
+ _COINS_KG_ROOT = str(RESEARCH_ROOT / "COINs-KGGeneration")
15
+ _DIGRESS_KG_SRC = str(RESEARCH_ROOT / "COINs-KGGeneration" / "graph_generation" / "src")
16
+ _MULTIPROXAN_ROOT = str(RESEARCH_ROOT / "MultiProxAn")
17
+ _MULTIPROXAN_SRC = str(RESEARCH_ROOT / "MultiProxAn" / "src")
18
  for _path in (_COINS_KG_ROOT, _DIGRESS_KG_SRC, _MULTIPROXAN_ROOT, _MULTIPROXAN_SRC):
19
  if _path not in sys.path:
20
  sys.path.insert(0, _path)
 
77
  SESSION_COOKIE_SECURE = True
78
  CSRF_TRUSTED_ORIGINS = [o for o in CORS_ALLOWED_ORIGINS if o.startswith("https://")]
79
 
80
+ # Checkpoints land here at runtime via huggingface_hub.snapshot_download.
81
+ # Defaults to RESEARCH_ROOT (defined above) for dev where weights sit alongside
82
+ # the research code; the deployment image overrides to /app/checkpoints.
 
83
  CHECKPOINTS_ROOT = Path(os.environ.get("CHECKPOINTS_ROOT", RESEARCH_ROOT))
84
 
85
  # Research code paths