Spaces:

mekosotto
/

hackathon

Running

mekosotto Claude Sonnet 4.6 commited on 8 days ago

Commit

1a15285

1 Parent(s): ff35cee

refactor: pin single-threaded determinism env; close Day-2 doc/typo gaps

- README: float32→float64, "will produce"→"produces" (I1)
- README: add eeg_pipeline.py to repository layout tree (I2)
- README: add Day-2 plan and EEG test file to Where to Look (M1)
- bbb_pipeline + eeg_pipeline: import os/pyarrow at top, set OMP/OPENBLAS/MKL=1 and pa thread counts at module level after logger (I3a)
- AGENTS.md §4: document Determinism environment paragraph (I3b)
- AGENTS.md §1: mark mri_pipeline.py as (planned, Day 3) (I4)
- pytest.ini: add markers block with slow: marker (M2)
- Day-1 plan: no float32 references found — no-op (M3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (5) hide show

AGENTS.md +8 -1
README.md +5 -2
pytest.ini +2 -0
src/pipelines/bbb_pipeline.py +11 -0
src/pipelines/eeg_pipeline.py +11 -0

AGENTS.md CHANGED Viewed

@@ -16,7 +16,7 @@ The platform exposes three production pipelines behind a single FastAPI surface:
 | Modality | Pipeline | Core Technique |
 |---|---|---|
-| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
 | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
 | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
@@ -72,6 +72,13 @@ Every modality pipeline MUST guarantee, before writing to `data/processed/`:
 4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
 5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
 A model training script is allowed to import from `data/processed/` only. If a
 training script references `data/raw/` directly, that is a bug and must be
 refactored into a pipeline.

 | Modality | Pipeline | Core Technique |
 |---|---|---|
+| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` *(planned, Day 3)* | ComBat Harmonization for site-level domain shift |
 | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
 | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
 4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
 5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
+**Determinism environment**: byte-identical output requires deterministic
+floating-point reductions. Each pipeline module sets `OMP_NUM_THREADS=1`,
+`OPENBLAS_NUM_THREADS=1`, `MKL_NUM_THREADS=1`, and pins pyarrow to
+single-threaded mode at import time. CI runners and developer machines do
+not need to set these manually — the pipeline modules handle it — but
+overriding them in the environment will break Determinism rule 3.
 A model training script is allowed to import from `data/processed/` only. If a
 training script references `data/raw/` directly, that is a bug and must be
 refactored into a pipeline.

README.md CHANGED Viewed

@@ -65,7 +65,8 @@ Result lives at `data/processed/eeg_features.parquet`.
 ├── src/
 │   ├── core/logger.py        # Shared structured logger (mandatory in every pipeline)
 │   ├── pipelines/
-│   │   └── bbb_pipeline.py   # Day-1 pipeline (4 public funcs + CLI entry)
 │   └── api/                  # FastAPI surface (placeholder until Day 4+)
 └── tests/
     ├── core/, pipelines/     # Mirror src/ structure
@@ -103,7 +104,7 @@ The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet o
 Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
 compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
 widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
-for the `float32` EEG features Day 2 will produce. See AGENTS.md §6.
 ## Testing & TDD
@@ -124,5 +125,7 @@ finishes in under 2 seconds on a 2024 laptop.
 - **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
 - **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
 - **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
 - **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)

 ├── src/
 │   ├── core/logger.py        # Shared structured logger (mandatory in every pipeline)
 │   ├── pipelines/
+│   │   ├── bbb_pipeline.py   # Day-1 pipeline (4 public funcs + CLI entry)
+│   │   └── eeg_pipeline.py   # Day-2 pipeline (6 public funcs + CLI entry)
 │   └── api/                  # FastAPI surface (placeholder until Day 4+)
 └── tests/
     ├── core/, pipelines/     # Mirror src/ structure
 Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
 compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
 widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
+for the `float64` EEG features Day 2 produces. See AGENTS.md §6.
 ## Testing & TDD
 - **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
 - **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
+- **Day-2 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day2-eeg-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day2-eeg-pipeline.md)
 - **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
 - **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)
+- **EEG pipeline:** [`src/pipelines/eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) + [`tests/pipelines/test_eeg_pipeline.py`](tests/pipelines/test_eeg_pipeline.py)

pytest.ini CHANGED Viewed

@@ -2,3 +2,5 @@
 testpaths = tests
 pythonpath = .
 addopts = -v --tb=short

 testpaths = tests
 pythonpath = .
 addopts = -v --tb=short
+markers =
+    slow: marks tests as slow (deselect with '-m "not slow"')

src/pipelines/bbb_pipeline.py CHANGED Viewed

@@ -11,10 +11,12 @@ traceability (row count in / out / dropped), and idempotent output.
 from __future__ import annotations
 import math
 from pathlib import Path
 import numpy as np
 import pandas as pd
 from rdkit import Chem, RDLogger
 from rdkit.Chem import AllChem
 from rdkit.DataStructs import ConvertToNumpyArray
@@ -23,6 +25,15 @@ from src.core.logger import get_logger
 logger = get_logger(__name__)
 # Suppress RDKit's noisy C++-level warning stream; we surface our own
 # structured warnings via the project logger when a SMILES fails to parse.
 #

 from __future__ import annotations
 import math
+import os
 from pathlib import Path
 import numpy as np
 import pandas as pd
+import pyarrow as pa
 from rdkit import Chem, RDLogger
 from rdkit.Chem import AllChem
 from rdkit.DataStructs import ConvertToNumpyArray
 logger = get_logger(__name__)
+# Pin BLAS / OpenMP / pyarrow to single-threaded mode so byte-determinism
+# (AGENTS.md §4 rule 3) holds across hardware. Without this, multi-threaded
+# floating-point reductions can reorder and produce non-bit-identical output.
+os.environ.setdefault("OMP_NUM_THREADS", "1")
+os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
+os.environ.setdefault("MKL_NUM_THREADS", "1")
+pa.set_cpu_count(1)
+pa.set_io_thread_count(1)
 # Suppress RDKit's noisy C++-level warning stream; we surface our own
 # structured warnings via the project logger when a SMILES fails to parse.
 #

src/pipelines/eeg_pipeline.py CHANGED Viewed

@@ -12,11 +12,13 @@ a logged WARNING), determinism (seeded ICA + sklearn RNG), traceability
 """
 from __future__ import annotations
 from pathlib import Path
 import mne
 import numpy as np
 import pandas as pd
 from mne.preprocessing import ICA
 from scipy import signal as scipy_signal
 from scipy import stats as scipy_stats
@@ -25,6 +27,15 @@ from src.core.logger import get_logger
 logger = get_logger(__name__)
 # Pearson-correlation threshold for EOG-component rejection in ICA.
 # Real-world EOG components typically score 0.8-0.95 against the EOG channel;
 # 0.9 is a conservative floor that avoids false positives at the cost of

 """
 from __future__ import annotations
+import os
 from pathlib import Path
 import mne
 import numpy as np
 import pandas as pd
+import pyarrow as pa
 from mne.preprocessing import ICA
 from scipy import signal as scipy_signal
 from scipy import stats as scipy_stats
 logger = get_logger(__name__)
+# Pin BLAS / OpenMP / pyarrow to single-threaded mode so byte-determinism
+# (AGENTS.md §4 rule 3) holds across hardware. Without this, multi-threaded
+# floating-point reductions can reorder and produce non-bit-identical output.
+os.environ.setdefault("OMP_NUM_THREADS", "1")
+os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
+os.environ.setdefault("MKL_NUM_THREADS", "1")
+pa.set_cpu_count(1)
+pa.set_io_thread_count(1)
 # Pearson-correlation threshold for EOG-component rejection in ICA.
 # Real-world EOG components typically score 0.8-0.95 against the EOG channel;
 # 0.9 is a conservative floor that avoids false positives at the cost of