mekosotto Claude Sonnet 4.6 commited on
Commit
1a15285
·
1 Parent(s): ff35cee

refactor: pin single-threaded determinism env; close Day-2 doc/typo gaps

Browse files

- README: float32→float64, "will produce"→"produces" (I1)
- README: add eeg_pipeline.py to repository layout tree (I2)
- README: add Day-2 plan and EEG test file to Where to Look (M1)
- bbb_pipeline + eeg_pipeline: import os/pyarrow at top, set OMP/OPENBLAS/MKL=1 and pa thread counts at module level after logger (I3a)
- AGENTS.md §4: document Determinism environment paragraph (I3b)
- AGENTS.md §1: mark mri_pipeline.py as (planned, Day 3) (I4)
- pytest.ini: add markers block with slow: marker (M2)
- Day-1 plan: no float32 references found — no-op (M3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AGENTS.md CHANGED
@@ -16,7 +16,7 @@ The platform exposes three production pipelines behind a single FastAPI surface:
16
 
17
  | Modality | Pipeline | Core Technique |
18
  |---|---|---|
19
- | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
20
  | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
21
  | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
22
 
@@ -72,6 +72,13 @@ Every modality pipeline MUST guarantee, before writing to `data/processed/`:
72
  4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
73
  5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
74
 
 
 
 
 
 
 
 
75
  A model training script is allowed to import from `data/processed/` only. If a
76
  training script references `data/raw/` directly, that is a bug and must be
77
  refactored into a pipeline.
 
16
 
17
  | Modality | Pipeline | Core Technique |
18
  |---|---|---|
19
+ | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` *(planned, Day 3)* | ComBat Harmonization for site-level domain shift |
20
  | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
21
  | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
22
 
 
72
  4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
73
  5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
74
 
75
+ **Determinism environment**: byte-identical output requires deterministic
76
+ floating-point reductions. Each pipeline module sets `OMP_NUM_THREADS=1`,
77
+ `OPENBLAS_NUM_THREADS=1`, `MKL_NUM_THREADS=1`, and pins pyarrow to
78
+ single-threaded mode at import time. CI runners and developer machines do
79
+ not need to set these manually — the pipeline modules handle it — but
80
+ overriding them in the environment will break Determinism rule 3.
81
+
82
  A model training script is allowed to import from `data/processed/` only. If a
83
  training script references `data/raw/` directly, that is a bug and must be
84
  refactored into a pipeline.
README.md CHANGED
@@ -65,7 +65,8 @@ Result lives at `data/processed/eeg_features.parquet`.
65
  ├── src/
66
  │ ├── core/logger.py # Shared structured logger (mandatory in every pipeline)
67
  │ ├── pipelines/
68
- │ │ ── bbb_pipeline.py # Day-1 pipeline (4 public funcs + CLI entry)
 
69
  │ └── api/ # FastAPI surface (placeholder until Day 4+)
70
  └── tests/
71
  ├── core/, pipelines/ # Mirror src/ structure
@@ -103,7 +104,7 @@ The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet o
103
  Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
104
  compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
105
  widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
106
- for the `float32` EEG features Day 2 will produce. See AGENTS.md §6.
107
 
108
  ## Testing & TDD
109
 
@@ -124,5 +125,7 @@ finishes in under 2 seconds on a 2024 laptop.
124
 
125
  - **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
126
  - **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
 
127
  - **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
128
  - **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)
 
 
65
  ├── src/
66
  │ ├── core/logger.py # Shared structured logger (mandatory in every pipeline)
67
  │ ├── pipelines/
68
+ │ │ ── bbb_pipeline.py # Day-1 pipeline (4 public funcs + CLI entry)
69
+ │ │ └── eeg_pipeline.py # Day-2 pipeline (6 public funcs + CLI entry)
70
  │ └── api/ # FastAPI surface (placeholder until Day 4+)
71
  └── tests/
72
  ├── core/, pipelines/ # Mirror src/ structure
 
104
  Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
105
  compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
106
  widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
107
+ for the `float64` EEG features Day 2 produces. See AGENTS.md §6.
108
 
109
  ## Testing & TDD
110
 
 
125
 
126
  - **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
127
  - **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
128
+ - **Day-2 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day2-eeg-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day2-eeg-pipeline.md)
129
  - **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
130
  - **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)
131
+ - **EEG pipeline:** [`src/pipelines/eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) + [`tests/pipelines/test_eeg_pipeline.py`](tests/pipelines/test_eeg_pipeline.py)
pytest.ini CHANGED
@@ -2,3 +2,5 @@
2
  testpaths = tests
3
  pythonpath = .
4
  addopts = -v --tb=short
 
 
 
2
  testpaths = tests
3
  pythonpath = .
4
  addopts = -v --tb=short
5
+ markers =
6
+ slow: marks tests as slow (deselect with '-m "not slow"')
src/pipelines/bbb_pipeline.py CHANGED
@@ -11,10 +11,12 @@ traceability (row count in / out / dropped), and idempotent output.
11
  from __future__ import annotations
12
 
13
  import math
 
14
  from pathlib import Path
15
 
16
  import numpy as np
17
  import pandas as pd
 
18
  from rdkit import Chem, RDLogger
19
  from rdkit.Chem import AllChem
20
  from rdkit.DataStructs import ConvertToNumpyArray
@@ -23,6 +25,15 @@ from src.core.logger import get_logger
23
 
24
  logger = get_logger(__name__)
25
 
 
 
 
 
 
 
 
 
 
26
  # Suppress RDKit's noisy C++-level warning stream; we surface our own
27
  # structured warnings via the project logger when a SMILES fails to parse.
28
  #
 
11
  from __future__ import annotations
12
 
13
  import math
14
+ import os
15
  from pathlib import Path
16
 
17
  import numpy as np
18
  import pandas as pd
19
+ import pyarrow as pa
20
  from rdkit import Chem, RDLogger
21
  from rdkit.Chem import AllChem
22
  from rdkit.DataStructs import ConvertToNumpyArray
 
25
 
26
  logger = get_logger(__name__)
27
 
28
+ # Pin BLAS / OpenMP / pyarrow to single-threaded mode so byte-determinism
29
+ # (AGENTS.md §4 rule 3) holds across hardware. Without this, multi-threaded
30
+ # floating-point reductions can reorder and produce non-bit-identical output.
31
+ os.environ.setdefault("OMP_NUM_THREADS", "1")
32
+ os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
33
+ os.environ.setdefault("MKL_NUM_THREADS", "1")
34
+ pa.set_cpu_count(1)
35
+ pa.set_io_thread_count(1)
36
+
37
  # Suppress RDKit's noisy C++-level warning stream; we surface our own
38
  # structured warnings via the project logger when a SMILES fails to parse.
39
  #
src/pipelines/eeg_pipeline.py CHANGED
@@ -12,11 +12,13 @@ a logged WARNING), determinism (seeded ICA + sklearn RNG), traceability
12
  """
13
  from __future__ import annotations
14
 
 
15
  from pathlib import Path
16
 
17
  import mne
18
  import numpy as np
19
  import pandas as pd
 
20
  from mne.preprocessing import ICA
21
  from scipy import signal as scipy_signal
22
  from scipy import stats as scipy_stats
@@ -25,6 +27,15 @@ from src.core.logger import get_logger
25
 
26
  logger = get_logger(__name__)
27
 
 
 
 
 
 
 
 
 
 
28
  # Pearson-correlation threshold for EOG-component rejection in ICA.
29
  # Real-world EOG components typically score 0.8-0.95 against the EOG channel;
30
  # 0.9 is a conservative floor that avoids false positives at the cost of
 
12
  """
13
  from __future__ import annotations
14
 
15
+ import os
16
  from pathlib import Path
17
 
18
  import mne
19
  import numpy as np
20
  import pandas as pd
21
+ import pyarrow as pa
22
  from mne.preprocessing import ICA
23
  from scipy import signal as scipy_signal
24
  from scipy import stats as scipy_stats
 
27
 
28
  logger = get_logger(__name__)
29
 
30
+ # Pin BLAS / OpenMP / pyarrow to single-threaded mode so byte-determinism
31
+ # (AGENTS.md §4 rule 3) holds across hardware. Without this, multi-threaded
32
+ # floating-point reductions can reorder and produce non-bit-identical output.
33
+ os.environ.setdefault("OMP_NUM_THREADS", "1")
34
+ os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
35
+ os.environ.setdefault("MKL_NUM_THREADS", "1")
36
+ pa.set_cpu_count(1)
37
+ pa.set_io_thread_count(1)
38
+
39
  # Pearson-correlation threshold for EOG-component rejection in ICA.
40
  # Real-world EOG components typically score 0.8-0.95 against the EOG channel;
41
  # 0.9 is a conservative floor that avoids false positives at the cost of