Spaces:

mekosotto
/

hackathon

Running

mekosotto Claude Opus 4.7 (1M context) commited on 4 days ago

Commit

a4e83c4

1 Parent(s): 04a1131

fix(deploy): harden HF Space build — git, MLflow off, seed artifacts early

The previous build's truncated log showed pip install succeeded but
something later in the heavy pipeline-train RUN block aborted. Three
defensive changes so the next build either succeeds or fails with a
clearer diagnosis:

- apt-get install git: silences the MLflow 'Bad git executable' warning
and lets MLflow tag runs with a proper SHA when needed.
- NEUROBRIDGE_DISABLE_MLFLOW=1 prefixed on every build-time pipeline
invocation: avoids MLflow run-tagging fragility in the slim image
during build. The runtime entrypoint can re-enable MLflow if desired.
- Move 'python scripts/seed_demo_artifacts.py' BEFORE the pipeline
train block so the core showcase paths (MRI 2D / MRI ONNX / EEG joblib
/ clinical RAG / axial PNG) are guaranteed to land even if the BBB
classifier train or MRI ComBat pipeline trips. The seed step is also
re-run after RAG ingest (idempotent — only fills missing artifacts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

Dockerfile +16 -8
Dockerfile.hf +16 -8

Dockerfile CHANGED Viewed

@@ -13,6 +13,7 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
 # --- system deps for RDKit, nibabel, MNE ---
 RUN apt-get update && apt-get install -y --no-install-recommends \
     build-essential \
     libgomp1 \
     libxrender1 \
     libsm6 \
@@ -40,17 +41,26 @@ COPY supervisord.conf ./supervisord.conf
 COPY docker-entrypoint.sh ./docker-entrypoint.sh
 RUN chmod +x /app/docker-entrypoint.sh
 # Seed raw data from fixtures so the deployed Signal/Image/Molecule tabs
 # work on first click. Then run all three pipelines so mlruns/ contains
 # one run per modality — feeds /experiments/runs and the BBB provenance
 # strip. data/raw/* is gitignored locally so we cannot COPY it.
 RUN mkdir -p data/raw data/processed && \
     cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv && \
     cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif && \
-    python -m src.pipelines.bbb_pipeline && \
-    python -m src.models.bbb_model && \
-    python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
-    python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
 # --- RAG knowledge base ingest ---
 # Build the FAISS index from any seed docs in tests/fixtures/kb_sample/
@@ -60,10 +70,8 @@ RUN mkdir -p data/raw data/processed && \
 COPY tests/fixtures/kb_sample/ ./data/knowledge_base/seed/
 RUN python -m src.rag.ingest data/knowledge_base data/processed/faiss_index
-# --- Demo-time artifacts (MRI 2D / MRI volumetric ONNX / EEG joblib /
-#     clinical TF-IDF RAG / axial PNG fixture). Idempotent script;
-#     entrypoint also re-runs it on container start so a mounted-volume
-#     deployment can re-seed without a rebuild.
 RUN python scripts/seed_demo_artifacts.py
 # --- HF Spaces convention ---

 # --- system deps for RDKit, nibabel, MNE ---
 RUN apt-get update && apt-get install -y --no-install-recommends \
     build-essential \
+    git \
     libgomp1 \
     libxrender1 \
     libsm6 \
 COPY docker-entrypoint.sh ./docker-entrypoint.sh
 RUN chmod +x /app/docker-entrypoint.sh
+# Seed demo artifacts FIRST so even if a heavier pipeline step fails, the
+# core showcase paths (MRI 2D, MRI volumetric ONNX, EEG joblib, clinical
+# RAG, axial PNG) still work. seed_demo_artifacts.py is idempotent.
+RUN python scripts/seed_demo_artifacts.py
 # Seed raw data from fixtures so the deployed Signal/Image/Molecule tabs
 # work on first click. Then run all three pipelines so mlruns/ contains
 # one run per modality — feeds /experiments/runs and the BBB provenance
 # strip. data/raw/* is gitignored locally so we cannot COPY it.
+#
+# NEUROBRIDGE_DISABLE_MLFLOW=1 during build avoids MLflow run-tagging
+# fragility in the slim image (no real .git tree to tag against). The
+# entrypoint can re-run with MLflow on if desired.
 RUN mkdir -p data/raw data/processed && \
     cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv && \
     cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif && \
+    NEUROBRIDGE_DISABLE_MLFLOW=1 python -m src.pipelines.bbb_pipeline && \
+    NEUROBRIDGE_DISABLE_MLFLOW=1 python -m src.models.bbb_model && \
+    NEUROBRIDGE_DISABLE_MLFLOW=1 python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
+    NEUROBRIDGE_DISABLE_MLFLOW=1 python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
 # --- RAG knowledge base ingest ---
 # Build the FAISS index from any seed docs in tests/fixtures/kb_sample/
 COPY tests/fixtures/kb_sample/ ./data/knowledge_base/seed/
 RUN python -m src.rag.ingest data/knowledge_base data/processed/faiss_index
+# --- Re-run demo-artifact seeding after RAG ingest in case any step above
+#     altered what's on disk. Idempotent — only fills missing artifacts.
 RUN python scripts/seed_demo_artifacts.py
 # --- HF Spaces convention ---

Dockerfile.hf CHANGED Viewed

@@ -13,6 +13,7 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
 # --- system deps for RDKit, nibabel, MNE ---
 RUN apt-get update && apt-get install -y --no-install-recommends \
     build-essential \
     libgomp1 \
     libxrender1 \
     libsm6 \
@@ -40,17 +41,26 @@ COPY supervisord.conf ./supervisord.conf
 COPY docker-entrypoint.sh ./docker-entrypoint.sh
 RUN chmod +x /app/docker-entrypoint.sh
 # Seed raw data from fixtures so the deployed Signal/Image/Molecule tabs
 # work on first click. Then run all three pipelines so mlruns/ contains
 # one run per modality — feeds /experiments/runs and the BBB provenance
 # strip. data/raw/* is gitignored locally so we cannot COPY it.
 RUN mkdir -p data/raw data/processed && \
     cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv && \
     cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif && \
-    python -m src.pipelines.bbb_pipeline && \
-    python -m src.models.bbb_model && \
-    python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
-    python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
 # --- RAG knowledge base ingest ---
 # Build the FAISS index from any seed docs in tests/fixtures/kb_sample/
@@ -60,10 +70,8 @@ RUN mkdir -p data/raw data/processed && \
 COPY tests/fixtures/kb_sample/ ./data/knowledge_base/seed/
 RUN python -m src.rag.ingest data/knowledge_base data/processed/faiss_index
-# --- Demo-time artifacts (MRI 2D / MRI volumetric ONNX / EEG joblib /
-#     clinical TF-IDF RAG / axial PNG fixture). Idempotent script;
-#     entrypoint also re-runs it on container start so a mounted-volume
-#     deployment can re-seed without a rebuild.
 RUN python scripts/seed_demo_artifacts.py
 # --- HF Spaces convention ---

 # --- system deps for RDKit, nibabel, MNE ---
 RUN apt-get update && apt-get install -y --no-install-recommends \
     build-essential \
+    git \
     libgomp1 \
     libxrender1 \
     libsm6 \
 COPY docker-entrypoint.sh ./docker-entrypoint.sh
 RUN chmod +x /app/docker-entrypoint.sh
+# Seed demo artifacts FIRST so even if a heavier pipeline step fails, the
+# core showcase paths (MRI 2D, MRI volumetric ONNX, EEG joblib, clinical
+# RAG, axial PNG) still work. seed_demo_artifacts.py is idempotent.
+RUN python scripts/seed_demo_artifacts.py
 # Seed raw data from fixtures so the deployed Signal/Image/Molecule tabs
 # work on first click. Then run all three pipelines so mlruns/ contains
 # one run per modality — feeds /experiments/runs and the BBB provenance
 # strip. data/raw/* is gitignored locally so we cannot COPY it.
+#
+# NEUROBRIDGE_DISABLE_MLFLOW=1 during build avoids MLflow run-tagging
+# fragility in the slim image (no real .git tree to tag against). The
+# entrypoint can re-run with MLflow on if desired.
 RUN mkdir -p data/raw data/processed && \
     cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv && \
     cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif && \
+    NEUROBRIDGE_DISABLE_MLFLOW=1 python -m src.pipelines.bbb_pipeline && \
+    NEUROBRIDGE_DISABLE_MLFLOW=1 python -m src.models.bbb_model && \
+    NEUROBRIDGE_DISABLE_MLFLOW=1 python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
+    NEUROBRIDGE_DISABLE_MLFLOW=1 python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
 # --- RAG knowledge base ingest ---
 # Build the FAISS index from any seed docs in tests/fixtures/kb_sample/
 COPY tests/fixtures/kb_sample/ ./data/knowledge_base/seed/
 RUN python -m src.rag.ingest data/knowledge_base data/processed/faiss_index
+# --- Re-run demo-artifact seeding after RAG ingest in case any step above
+#     altered what's on disk. Idempotent — only fills missing artifacts.
 RUN python scripts/seed_demo_artifacts.py
 # --- HF Spaces convention ---