hackathon / AGENTS.md
bekir32419's picture
Add project files
c0a7163

AGENTS.md — NeuroBridge Enterprise Pipeline

Read this file at the start of every session. It is the contract every agent (human or LLM) operates under in this repository.

1. Project Vision

NeuroBridge Enterprise is a B2B SaaS platform that solves three structural problems in real-world clinical/biomedical ML pipelines:

  1. Data Drift between hospitals and acquisition sites (multi-center MRI).
  2. Missing Modalities (a patient may have MRI but no EEG, or vice versa).
  3. Artifacts in raw biosignals (eye blinks, line noise, motion in EEG).

The platform exposes three production pipelines behind a single FastAPI surface:

Modality Pipeline Core Technique
Image (MRI / fMRI) src/pipelines/mri_pipeline.py ComBat Harmonization for site-level domain shift
Signal (EEG) src/pipelines/eeg_pipeline.py MNE-Python + ICA for artifact removal
Tabular (BBB / molecules) src/pipelines/bbb_pipeline.py RDKit Morgan fingerprints from SMILES

All experiment runs are tracked in MLflow. All services ship as Docker images.

2. Directory Layout (load-bearing — do not violate)

.
├── AGENTS.md                 # This file
├── README.md
├── requirements.txt
├── pytest.ini
├── conftest.py               # Repo-wide pytest fixtures (autouse: pins MLFLOW_TRACKING_URI to tmp dir for test isolation)
├── Dockerfile                # Production image (FastAPI + pipelines)
├── docker-compose.yml        # api + mlflow services for local stack
├── .dockerignore
├── .streamlit/
│   └── config.toml           # Streamlit theme tokens
├── data/
│   ├── raw/                  # Untouched source data. NEVER train on this directly.
│   └── processed/            # Pipeline output as Parquet (preserves dtypes; overwritten each run; see §4).
├── src/
│   ├── api/                  # FastAPI surface
│   │   ├── main.py           # App factory + /health
│   │   ├── routes.py         # POST /pipeline/{bbb,eeg,mri} dispatch
│   │   └── schemas.py        # Shared Pydantic request/response models
│   ├── core/                 # Cross-cutting utilities
│   │   ├── logger.py         # Structured logger (mandatory in every pipeline)
│   │   ├── determinism.py    # Thread-pin env vars (OMP/OPENBLAS/MKL/pyarrow)
│   │   ├── storage.py        # Parquet read/write helpers (snappy, single-threaded, deterministic)
│   │   └── tracking.py       # MLflow `track_pipeline_run` context manager (see §7)
│   ├── pipelines/            # One file per modality. Pure functions + a `run_pipeline()` entry.
│   ├── models/               # Downstream decision-layer models
│   │   ├── bbb_model.py      # BBB-permeability classifier + SHAP explainer + trainer CLI
│   │   └── mri_model.py      # Volumetric MRI ONNX inference surface (external training)
│   ├── llm/                  # Natural-language explainers (template + OpenRouter fallback)
│   ├── rag/                  # Fastembed + FAISS retrieval layer
│   ├── agents/               # Tool registry + guarded OpenRouter orchestrator
│   └── frontend/
│       └── app.py            # Streamlit dashboard
└── tests/
    ├── core/
    ├── api/
    ├── frontend/
    ├── pipelines/            # incl. test_cross_pipeline_smoke.py for integration coverage
    └── fixtures/             # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py)

Rules:

  • New modality → new file under src/pipelines/. No mixing modalities in one file.
  • Anything imported by 2+ pipelines → src/core/.
  • Pipeline code (src/pipelines/, src/core/) must not read from or write to any path outside data/. Test code may read tests/fixtures/. The data/ boundary is the storage contract for production data.
  • tests/fixtures/ holds CSV / numpy / DICOM blobs — do not add an __init__.py there.

3. Coding Standards

  • Python 3.10–3.12 (the pinned native-extension dependencies do not yet ship cp313+ wheels). Use from __future__ import annotations when needed for forward refs.
  • Type hints are mandatory on every public function/method (parameters and return).
  • Modular structure. One responsibility per function. If a function exceeds ~40 lines or 3 levels of nesting, split it.
  • TDD is the default workflow. Write the failing test first, watch it fail, then implement. Tests live in tests/ mirroring src/.
  • Logging is mandatory for every pipeline. Use src.core.logger.get_logger(__name__). No print() in src/.
  • Docstrings on every public function — one-line summary + Args/Returns when non-trivial.
  • No hard-coded paths in business logic. Pass paths as arguments to run_pipeline(input_path, output_path).
  • Format & lint: keep imports sorted; prefer pathlib.Path over os.path.
  • Commits are small and frequent. Each green test → commit.

4. Data Readiness Principles

The Golden Rule: never train a model directly on raw data. Raw data must always pass through a pipeline first.

Every modality pipeline MUST guarantee, before writing to data/processed/:

  1. Schema validity — required columns present, expected dtypes.
  2. Domain validity — invalid records (e.g. unparseable SMILES, NaN-only EEG epochs, corrupted DICOMs) are logged with their identifier and dropped, never silently coerced.
  3. Determinism — given the same data/raw/ input, the pipeline produces byte-identical data/processed/ output. No wall-clock, no random seeds without explicit seeding.
  4. Traceability — log row count in, row count out, and percentage dropped at INFO level.
  5. Idempotence — re-running the pipeline overwrites data/processed/ cleanly; no append, no partial writes.

Determinism environment: byte-identical output requires deterministic floating-point reductions. Each pipeline module sets OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1, and pins pyarrow to single-threaded mode at import time. CI runners and developer machines do not need to set these manually — the pipeline modules handle it — but overriding them in the environment will break Determinism rule 3.

ComBat determinism boundary: the MRI pipeline's harmonize_combat wraps neuroHarmonize.harmonizationLearn and applies np.round(14) to its output. This is a defensive measure: with the thread-pinning above, harmonization is already bit-identical, but the rounding guarantees byte-identity even when the env-pin discipline is bypassed (e.g. a sub-process that re-exports a thread count). It discards ~5 trailing-mantissa bits of float64 — well below ComBat's biological effect-size precision floor.

A model training script is allowed to import from data/processed/ only. If a training script references data/raw/ directly, that is a bug and must be refactored into a pipeline.

5. How to Add a New Pipeline (checklist)

  1. Add tests/pipelines/test_<name>_pipeline.py with the failing tests first.
  2. Create src/pipelines/<name>_pipeline.py exposing run_pipeline(input_path: Path, output_path: Path) -> None.
  3. Use get_logger(__name__) for all status output (per §3).
  4. Apply the §4 Data Readiness contract: validate + drop invalid records with a logged WARNING (identifier + count), log row count in/out/dropped at INFO, write deterministically, and overwrite (do not append) on re-run.
  5. Write deterministic output to output_path.
  6. Document any new dependency in requirements.txt (pinned).
  7. Add a one-line entry to this file's pipeline table.

6. Storage Format Convention

All data/processed/ outputs MUST be Parquet (pyarrow engine, compression="snappy"):

  • Preserves dtypes (uint8 fingerprints stay uint8; float64 EEG features stay float64) — CSV silently widens numeric columns and is unsuitable for the high-dimensional float arrays produced by the EEG and MRI pipelines.
  • Byte-deterministic with fixed compression and single-threaded writes (satisfies §4 Determinism).
  • Read with pd.read_parquet(path); no dtype hints required.

The raw data/raw/ inputs may be in any vendor-supplied format (CSV for BBBP, EDF/FIF for EEG, NIfTI for MRI).

7. Experiment Tracking

Every run_pipeline() invocation logs to MLflow via src.core.tracking.track_pipeline_run:

  • Experiment names match the pipeline module: bbb_pipeline, eeg_pipeline, mri_pipeline.
  • Params: input/output paths and pipeline hyperparameters (e.g. BBB n_bits / radius, EEG epoch_duration_s / random_state, MRI intensity_threshold / n_roi_axes).
  • Metrics: row counts (rows_in, rows_out, rows_dropped — or modality equivalent like subjects_in/out/dropped) and duration_sec.
  • Artifact: the produced Parquet at data/processed/<modality>_features.parquet.

The tracking URI is read from MLFLOW_TRACKING_URI (defaults to ./mlruns/ when unset).

Live-demo lifeline: set NEUROBRIDGE_DISABLE_MLFLOW=1 to skip tracking entirely — the helper yields None and emits no MLflow calls. Use this when the tracking server is unreachable (offline demo, network outage, or CI without an MLflow service). Pipelines complete normally; only the run metadata is lost.

The repo-wide conftest.py autouse fixture pins MLFLOW_TRACKING_URI to a tmp directory for the test session, so the production mlruns/ directory is never written by the test suite. Tests that interact with MLflow (in tests/core/test_tracking.py and the per-pipeline Test<Modality>PipelineMLflow classes) all share this isolated store.

8. Decision Layer (Downstream Models)

Pipelines produce features (data/processed/<modality>_features.parquet). Downstream models live in src/models/ and consume processed features or a deterministic model-local preprocessing contract:

Model File Output Endpoint
BBB permeability src/models/bbb_model.py data/processed/bbb_model.joblib POST /predict/bbb
MRI image classifier src/models/mri_model.py data/processed/mri_model.onnx POST /predict/mri

In-repo trainable downstream model modules expose a uniform surface:

  • train(df, label_col, ...) → fitted classifier
  • save(model, path) / load(path) → joblib artifact I/O
  • predict_with_proba(model, smiles){label, confidence} (confidence is the max-class probability)
  • explain_prediction(model, smiles, top_k) → SHAP top-k attributions sorted by |shap_value| descending

MRI DL exception: training happens outside this repo and exports ONNX, so it does not expose train() or SHAP. Runtime loads the ONNX artifact with mri_model.load(), preprocesses one NIfTI via the same deterministic resize + z-score contract used during training (preprocess_nifti()), then returns class probabilities via predict_nifti().

The API loads model artifacts at request time. If an artifact is missing, the endpoint returns HTTP 503 with a remediation hint instead of failing process startup. BBB points at the trainer CLI (python -m src.models.bbb_model); MRI points at the external ONNX export path.

Determinism: all in-repo classifiers are seeded (random_state=42 default), n_jobs=1 (no tree-parallelism races). Re-running the BBB trainer on the same Parquet produces identical predictions. MRI ONNX determinism is bounded by the exported model plus the fixed runtime preprocessing contract.

Override BBB_MODEL_PATH env var to point the API at a non-default artifact location (used by tests for tmp_path isolation).

Override MRI_MODEL_PATH env var to point the API at a non-default ONNX artifact location. If the ONNX artifact is missing, POST /predict/mri returns HTTP 503 with a remediation hint.

Calibration metadata (Day 6): train() does an 80/20 stratified split, computes precision-at-confidence-threshold bins on the held-out test set, and stashes them on model._neurobridge_calibration: list[dict] (sorted ascending by threshold). The API includes the bin matching each prediction's confidence in BBBPredictResponse.calibration. UI uses this to render an honest trust caption ("≥75% confident → 92% precision, n=18"). For tiny test fixtures where stratified split fails, calibration falls back to zero-support bins so the API contract is always populated.

9. Demo Features (Day 6)

The frontend includes three jury-day demo amplifiers that don't change the core contract:

  • Edge-case dropdown (BBB tab): a curated catalog of 5 robustness probes — invalid SMILES, empty input, OOD macrocycle (cyclosporine-like), heavy halogenated aromatic. Each has a stated expectation; the UI visualizes graceful failure (HTTP 400 → recoverable warning, never a crash).
  • Calibration trust caption (BBB decision card): renders the precision-at-confidence-threshold from BBBPredictResponse.calibration. Demonstrates that the system knows what it doesn't know.
  • MRI ComBat diagnostics (MRI tab): POST /pipeline/mri/diagnostics runs the pipeline twice (pre + post ComBat) and returns long-format data + site-gap KPIs (Pre, Post, Reduction factor). The UI renders a faceted altair density plot — visual proof that ComBat removes site-driven domain shift.

10. Drift Surface (Day 7)

Each predict route maintains a per-worker rolling window of recent prediction confidences (collections.deque(maxlen=100)). Train-time median + std are stashed on model._neurobridge_train_stats (joblib roundtrip-safe). The drift z-score is (rolling_median − train_median) / max(train_std, 1e-9), computed only when the buffer holds ≥10 samples AND the model has the train-stats attribute. The /predict/bbb response carries drift_z: float | None and rolling_n: int. The UI renders a one-line caption with a magnitude tag (in-band, mild, significant). Worker restart clears the deque; this is acceptable for demo and removes the audit-trail concern.

11. LLM Explainer Surface (Day 7 + 9)

src/llm/explainer.py is the single entry point for natural-language rationales. explain(payload) always returns {rationale, source, model}. The deterministic template path is the source of truth for tests; the LLM path is OpenRouter via the openai==1.51.0 SDK and walks a smartest → smallest free-tier fallback chain (_DEFAULT_FREE_MODEL_CHAIN, 10 ids — head: inclusionai/ling-2.6-1t:free). The chain is overridable at runtime via OPENROUTER_FREE_MODELS (comma-separated). Status-code classification:

  • 401 → key is bad → bail to template + actionable WARNING (rotate at https://openrouter.ai/keys, enable free-model data-sharing at https://openrouter.ai/settings/privacy).
  • 400 → prompt-shape mismatch on this model → advance to next.
  • 402 / 403 / 404 / 429 / 5xx → advance to next.
  • Network/timeout → bail to template (switching models won't help).

Two env knobs control the gate:

  • OPENROUTER_API_KEY — when absent, fallback to template.
  • NEUROBRIDGE_DISABLE_LLM=1 — hard kill-switch; force template even if a key is set. Use this for demo days when you want fully deterministic, reproducible rationales.

Prompt design (_build_llm_prompt): two intent modes. When the caller supplies user_question, the model is instructed to language-match (Turkish question → Turkish answer), answer the question directly (not a canned paper-style summary), and respond conversationally to off-topic / greeting questions. When no user_question is supplied, falls back to the original 2-4 sentence paper-style rationale.

The POST /explain/bbb endpoint mirrors this contract. Pydantic enforces a non-empty top_features list (422 on empty); every other failure mode degrades to template + WARNING log + source="template".

Diagnostics: GET /diag/openrouter (src/api/main.py) returns key-presence (length + 12-char prefix only), kill-switch state, chain length, first model id, and the result of an 8-token probe call against that model. Surfaced in Streamlit as the sidebar "🔧 Diagnose LLM" button. Use it when the deployed Space shows source="template" unexpectedly — the most common causes are a missing/misnamed OPENROUTER_API_KEY Space secret or a revoked key.

12. Multi-Modal Explainer (Day 8)

src/llm/explainer.py exposes explain(payload, modality) where modality ∈ {"bbb", "eeg", "mri"}. Each modality has its own deterministic template (_template_explain_bbb / _eeg / _mri) and its own LLM prompt header. Unknown modality strings degrade to the BBB template with a warning log; the function never raises. The hybrid OpenRouter fallback contract from §11 applies uniformly.

The API exposes three matching endpoints — POST /explain/{bbb,eeg,mri} — each on the explain_router (/explain prefix). Streamlit surfaces the BBB version in the AI Assistant tab and the EEG/MRI versions as inline expanders inside their respective pipeline tabs.

13. Experiments Surface (Day 8)

GET /experiments/runs returns up to 50 most recent MLflow runs across the bbb/eeg/mri experiments, flattened into a list of MLflowRunSummary (run_id, experiment_name, start_time, status, metrics, params). POST /experiments/diff {run_id_a, run_id_b} returns a side-by-side metric+param diff (RunDiffRow).

When NEUROBRIDGE_DISABLE_MLFLOW=1, both endpoints return empty responses without raising — useful for deployments where there is no writable mlruns/ tree or the tracking server is unavailable. Unknown run ids → 404.

The Streamlit "Experiments" tab is the user-facing surface. Cached in session state with an explicit Refresh button.

14. Deploy Surface (Day 8)

Dockerfile.hf is the Hugging Face Spaces image. Single container, two processes (FastAPI :8000 + Streamlit :7860) launched via supervisord.conf. Build-time RUN python -m src.models.bbb_model bakes the BBB model artifact into the image so the first /predict/bbb call is instant on cold start. Build-time RAG ingest creates data/processed/faiss_index/.

docker-entrypoint.sh is the runtime guard for local Docker/Compose demos: when a mounted ./data volume hides image-built artifacts, it seeds fixture raw data, rebuilds missing BBB features/model artifacts, and rebuilds the FAISS index before starting supervisord. It does not bake NEUROBRIDGE_DISABLE_MLFLOW=1 into the image; operators may set that env at runtime if their tracking service is unavailable.

Default environment: DEPLOY_ENV=hf_spaces. The LLM kill-switch is not set — deployed Spaces use the real OpenRouter free-tier chain (§11) when OPENROUTER_API_KEY is configured in the Space's Secrets panel. Set NEUROBRIDGE_DISABLE_LLM=1 only when you want to force the deterministic template path for a fully-reproducible demo.

The README's YAML front-matter declares the Space metadata (SDK=docker, port=7860, app_file=src/frontend/app.py).

15. Orchestrator Agent Surface

src/agents/orchestrator.py exposes a single-agent function-calling loop over the openai SDK (no LangChain / framework dep). The API enables the guarded workflow mode: if the LLM skips or mis-shapes a required tool call, deterministic routing in src/agents/routing.py falls back to exactly one pipeline tool, then exactly one retrieval tool, then final synthesis. The agent holds 4 tools, defined in src/agents/tools.py:

  • run_bbb_pipeline(smiles, top_k) — wraps POST /predict/bbb
  • run_eeg_pipeline(input_path) — wraps POST /pipeline/eeg
  • run_mri_pipeline(input_dir, sites_csv=None) — wraps POST /pipeline/mri and defaults sites_csv to <input_dir>/sites.csv
  • retrieve_context(query, k) — wraps src/rag/retrieve.py

The system prompt (src/agents/prompts.py:ORCHESTRATOR_SYSTEM_PROMPT) describes the workflow: pick exactly one pipeline → run it → formulate a focused retrieval query → call retrieve_context → synthesize a 3-5 sentence response that cites at least one chunk. The API-side workflow guard enforces that order in code; the prompt is guidance, not the only control plane. Language of the final response is mirrored from the user's question.

POST /agent/run is the public surface. It accepts user_input, optional user_question, and optional MRI sites_csv. Default model is google/gemini-2.0-flash-exp:free on OpenRouter (function-calling support verified). Override via NEUROBRIDGE_AGENT_MODEL env var. Returns 503 when OPENROUTER_API_KEY is unset.

Diagnostics: GET /diag/agent returns key presence, configured model, RAG index status (chunk count), and the registered tool names.

16. RAG Surface

src/rag/ is the retrieval layer. Stack: fastembed (BAAI/bge-small-en-v1.5, 384-dim, ONNX, no torch dep) for embeddings + faiss-cpu (IndexFlatIP after L2-norm = cosine) for vector search.

Knowledge base lives at data/knowledge_base/ (gitignored; user-supplied .md / .txt / .pdf). Build the FAISS index with:

python -m src.rag.ingest [<input_dir> [<output_dir>]]

Defaults: input=data/knowledge_base/, output=data/processed/faiss_index/. The Dockerfile runs this at build time so deployed Spaces start with a populated index. docker-entrypoint.sh also rebuilds the index at startup when a mounted data/ volume hides the image-built artifacts. Empty KB → empty index → retrieve_context returns 0 chunks; the agent surfaces this and answers from the pipeline result alone.

tests/fixtures/kb_sample/ ships 3 seed markdown files (Lipinski, ComBat, MNE+ICA) — these double as test fixtures and as the demo seed if no user-supplied PDFs are added.