docs: add AGENTS.md with vision, layout, standards, data readiness rules
Browse files
AGENTS.md
ADDED
|
@@ -0,0 +1,87 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AGENTS.md — NeuroBridge Enterprise Pipeline
|
| 2 |
+
|
| 3 |
+
> Read this file at the start of every session. It is the contract every agent
|
| 4 |
+
> (human or LLM) operates under in this repository.
|
| 5 |
+
|
| 6 |
+
## 1. Project Vision
|
| 7 |
+
|
| 8 |
+
**NeuroBridge Enterprise** is a B2B SaaS platform that solves three structural
|
| 9 |
+
problems in real-world clinical/biomedical ML pipelines:
|
| 10 |
+
|
| 11 |
+
1. **Data Drift** between hospitals and acquisition sites (multi-center MRI).
|
| 12 |
+
2. **Missing Modalities** (a patient may have MRI but no EEG, or vice versa).
|
| 13 |
+
3. **Artifacts** in raw biosignals (eye blinks, line noise, motion in EEG).
|
| 14 |
+
|
| 15 |
+
The platform exposes three production pipelines behind a single FastAPI surface:
|
| 16 |
+
|
| 17 |
+
| Modality | Pipeline | Core Technique |
|
| 18 |
+
|---|---|---|
|
| 19 |
+
| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
|
| 20 |
+
| Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
|
| 21 |
+
| Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
|
| 22 |
+
|
| 23 |
+
All experiment runs are tracked in **MLflow**. All services ship as **Docker** images.
|
| 24 |
+
|
| 25 |
+
## 2. Directory Layout (load-bearing — do not violate)
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
.
|
| 29 |
+
├── AGENTS.md # This file
|
| 30 |
+
├── requirements.txt
|
| 31 |
+
├── pytest.ini
|
| 32 |
+
├── data/
|
| 33 |
+
│ ├── raw/ # Untouched source data. NEVER train on this directly.
|
| 34 |
+
│ └── processed/ # Pipeline output. Model-ready. Versioned outputs.
|
| 35 |
+
├── src/
|
| 36 |
+
│ ├── api/ # FastAPI routers, request/response schemas
|
| 37 |
+
│ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
|
| 38 |
+
│ └── core/ # Cross-cutting utilities: logging, config, MLflow helpers
|
| 39 |
+
└── tests/
|
| 40 |
+
├── core/
|
| 41 |
+
├── pipelines/
|
| 42 |
+
└── fixtures/ # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py)
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
**Rules:**
|
| 46 |
+
- New modality → new file under `src/pipelines/`. No mixing modalities in one file.
|
| 47 |
+
- Anything imported by 2+ pipelines → `src/core/`.
|
| 48 |
+
- Never read from or write to paths outside `data/`. The `data/` boundary is the storage contract.
|
| 49 |
+
- `tests/fixtures/` holds CSV / numpy / DICOM blobs — do **not** add an `__init__.py` there.
|
| 50 |
+
|
| 51 |
+
## 3. Coding Standards
|
| 52 |
+
|
| 53 |
+
- **Python 3.10+.** Use `from __future__ import annotations` when needed for forward refs.
|
| 54 |
+
- **Type hints are mandatory** on every public function/method (parameters and return).
|
| 55 |
+
- **Modular structure.** One responsibility per function. If a function exceeds ~40 lines or 3 levels of nesting, split it.
|
| 56 |
+
- **TDD is the default workflow.** Write the failing test first, watch it fail, then implement. Tests live in `tests/` mirroring `src/`.
|
| 57 |
+
- **Logging is mandatory** for every pipeline. Use `src.core.logger.get_logger(__name__)`. No `print()` in `src/`.
|
| 58 |
+
- **Docstrings** on every public function — one-line summary + Args/Returns when non-trivial.
|
| 59 |
+
- **No hard-coded paths in business logic.** Pass paths as arguments to `run_pipeline(input_path, output_path)`.
|
| 60 |
+
- **Format & lint:** keep imports sorted; prefer `pathlib.Path` over `os.path`.
|
| 61 |
+
- **Commits are small and frequent.** Each green test → commit.
|
| 62 |
+
|
| 63 |
+
## 4. Data Readiness Principles
|
| 64 |
+
|
| 65 |
+
> **The Golden Rule: never train a model directly on raw data. Raw data must always pass through a pipeline first.**
|
| 66 |
+
|
| 67 |
+
Every modality pipeline MUST guarantee, before writing to `data/processed/`:
|
| 68 |
+
|
| 69 |
+
1. **Schema validity** — required columns present, expected dtypes.
|
| 70 |
+
2. **Domain validity** — invalid records (e.g. unparseable SMILES, NaN-only EEG epochs, corrupted DICOMs) are **logged with their identifier and dropped**, never silently coerced.
|
| 71 |
+
3. **Determinism** — given the same `data/raw/` input, the pipeline produces byte-identical `data/processed/` output. No wall-clock, no random seeds without explicit seeding.
|
| 72 |
+
4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
|
| 73 |
+
5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
|
| 74 |
+
|
| 75 |
+
A model training script is allowed to import from `data/processed/` only. If a
|
| 76 |
+
training script references `data/raw/` directly, that is a bug and must be
|
| 77 |
+
refactored into a pipeline.
|
| 78 |
+
|
| 79 |
+
## 5. How to Add a New Pipeline (checklist)
|
| 80 |
+
|
| 81 |
+
1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
|
| 82 |
+
2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
|
| 83 |
+
3. Use `get_logger(__name__)` for all status output.
|
| 84 |
+
4. Validate inputs and drop invalid rows with a logged warning.
|
| 85 |
+
5. Write deterministic output to `output_path`.
|
| 86 |
+
6. Document any new dependency in `requirements.txt` (pinned).
|
| 87 |
+
7. Add a one-line entry to this file's pipeline table.
|