- GRL SNR-Filtering Reproducibility Package
GRL SNR-Filtering Reproducibility Package
This package contains the code, trained model checkpoints, and derived result summaries used for the manuscript:
Signal-to-Noise Filtering Creates Deployment Bias in Seismic Deep Learning
Training and test datasets are intentionally not included. Download or access them from the data sources listed below, then pass their local paths to the scripts.
Package Contents
grl_publish/
code/
scripts/ # Phase-picking, dispersion, aggregation, and utility scripts
odata/ # Continuous-pick filtering and REAL association helper scripts
models/ # BRNN/PNSN model definitions
dispnet.v2.3.py # Dispersion model definition
checkpoints/
base/pnsn.v3.pt # Base PNSN checkpoint used for transfer learning
phase_picker/seed*/ # Fine-tuned phase-picker checkpoints for three seeds
dispersion/seed*/ # DispNet v2.3 checkpoints for three seeds
results/
phase_picker/seed*/ # Per-seed phase-picking summaries and training logs
dispersion/seed*/ # Per-seed dispersion summaries and training logs
multiseed/ # Three-seed aggregate tables
continuous_association/ # Final SNR-vs-confidence association summaries
bootstrap/ # Paired bootstrap inputs and confidence-interval tables
manuscript_figures/ # Final figures used by the manuscript
DATASETS.bib # Dataset citations
requirements.txt # Minimal Python package list
CHECKSUMS.sha256 # SHA-256 checksums for package files
No waveform HDF5 files, continuous pick JSONL files, station metadata, or other large data products are included.
Verify file integrity with:
shasum -a 256 -c CHECKSUMS.sha256
Data Sources
Use these data products to reproduce the analyses.
Continuous Waveform And Association Data
- Dataset: SeismicX-Cont (Revision 96367f8)
- URL: https://huggingface.co/datasets/cangyeone/SeismicX-Cont
- DOI:
10.57967/hf/9006 - Used for the two-day continuous association diagnostic and associated annotations.
Ambient-Noise Dispersion Data
- Dataset: SeisDispFusion-NCF (Revision afcd805)
- URL: https://huggingface.co/datasets/cangyeone/SeisDispFusion-NCF
- DOI:
10.57967/hf/9114 - Used for the NCF dispersion-estimation SNR filtering experiment.
CREDIT-X1local
- Article: CREDIT-X1local: A reference dataset for machine learning seismology from ChinArray in Southwest China
- DOI:
10.1016/j.eqs.2024.01.018 - URL: https://www.equsci.org.cn/en/article/doi/10.1016/j.eqs.2024.01.018
- Used for the phase-picking SNR transfer experiment.
BibTeX entries are provided in DATASETS.bib.
Recommended Citation
Please cite the GRL manuscript and the archived reproducibility package when reusing project-controlled materials from this archive:
@misc{yu2026seismicsnrfilteringbias,
author = {Yu, Ziye},
title = {seismic-snr-filtering-bias},
year = {2026},
url = {https://huggingface.co/cangyeone/seismic-snr-filtering-bias},
doi = {10.57967/hf/9115},
publisher = {Hugging Face},
note = {CC BY 4.0}
}
Data sources used by the analyses should also be cited:
- SeismicX-Cont, DOI
10.57967/hf/9006 - SeisDispFusion-NCF, DOI
10.57967/hf/9114 - CREDIT-X1local / Li et al. (2024), DOI
10.1016/j.eqs.2024.01.018
License
Unless otherwise noted, the project-controlled code, derived outputs, summary tables, figure assets, model checkpoints included in this archive, and reproducibility notes are released under the Creative Commons Attribution 4.0 International license (CC BY 4.0).
Under CC BY 4.0, reuse is permitted with appropriate credit, a link to the license, and indication of whether changes were made.
License: https://creativecommons.org/licenses/by/4.0/
SPDX-License-Identifier: CC-BY-4.0
Raw datasets that are not redistributed in this archive should be obtained from and cited through their original repositories or publications. This archive does not relicense third-party raw datasets or upstream waveform products.
Environment
Create an environment with Python 3.10 or newer, then install the core packages:
pip install -r requirements.txt
The phase-picking experiment can use Apple MPS if available. The dispersion script was run on CPU for the reported three-seed results.
Reproduce Phase-Picking Training
Expected local inputs:
- CREDIT-X1local waveform file, for example
/path/to/credit-x1.h5 - CREDIT split keys, for example
/path/to/creditkeys.npz - Base checkpoint included here:
checkpoints/base/pnsn.v3.pt
Run the three seeds:
for seed in 20260609 20260610 20260611; do
python code/scripts/snr_transfer_experiment.py \
--h5 /path/to/credit-x1.h5 \
--keys /path/to/creditkeys.npz \
--base-ckpt checkpoints/base/pnsn.v3.pt \
--out-dir results/phase_picker/seed${seed}_rerun \
--seed ${seed}
done
The manuscript run used:
--train-steps 2000--train-batch 16--eval-samples 10000- train records per condition:
60,837 - conditions: full matched, SNR>5 dB matched, SNR>10 dB matched
The published trained checkpoints are already included under checkpoints/phase_picker/seed*/.
Reproduce Dispersion Training
Expected local input:
- SeisDispFusion-NCF HDF5, for example
/path/to/ncf_disp_dataset_with_disp_image.h5
Run the three seeds:
for seed in 20260609 20260610 20260611; do
python code/scripts/disp_snr_transfer_experiment.py \
--h5 /path/to/ncf_disp_dataset_with_disp_image.h5 \
--out-dir results/dispersion/seed${seed}_rerun \
--seed ${seed} \
--epochs 5 \
--batch-size 256 \
--device cpu
done
The manuscript run used:
- train samples per condition:
11,033 - test samples:
8,292 - conditions: full matched, SNR>3.04 dB matched, SNR>6.77 dB matched
- learning rate:
2e-4 - optimizer: AdamW
The published trained checkpoints are already included under checkpoints/dispersion/seed*/.
Aggregate The Three Seeds
After rerunning both tasks, aggregate the per-seed summaries:
python code/scripts/grl_aggregate_multiseed.py \
--seeds 20260609 20260610 20260611 \
--out-dir results/multiseed_rerun
The manuscript aggregate outputs are included in results/multiseed/.
Reported training-task results:
| Task | Condition | Main Metric |
|---|---|---|
| Phase picking | Full matched | Mean F1 = 0.744 +/- 0.005 |
| Phase picking | SNR>5 dB matched | Mean F1 = 0.735 +/- 0.006 |
| Phase picking | SNR>10 dB matched | Mean F1 = 0.699 +/- 0.007 |
| Dispersion | Full matched | MAE = 0.0462 +/- 0.0008 km/s |
| Dispersion | SNR>3.04 dB matched | MAE = 0.0479 +/- 0.0014 km/s |
| Dispersion | SNR>6.77 dB matched | MAE = 0.0510 +/- 0.0004 km/s |
The +/- values are sample standard deviations across three seeds. Paired bootstrap tables for the deterministic seed-20260609 evaluation are included in results/bootstrap/.
Sample-level paired bootstrap checks:
| Task | Comparison | Difference vs full/confidence | 95% CI |
|---|---|---|---|
| Phase picking | SNR>5 dB mean F1 | -0.0076 |
[-0.0097, -0.0055] |
| Phase picking | SNR>10 dB mean F1 | -0.0470 |
[-0.0498, -0.0441] |
| Dispersion | SNR>3.04 dB MAE | +0.00315 km/s |
[+0.00276, +0.00353] |
| Dispersion | SNR>6.77 dB MAE | +0.00479 km/s |
[+0.00428, +0.00530] |
These bootstrap intervals use fixed seed-20260609 trained checkpoints and shared deterministic test samples. They do not replace the three-seed summaries.
Continuous Association Diagnostic
Code for filtering continuous picks and running REAL association is in code/odata/.
The manuscript diagnostic used:
- days:
2019-07-06and2021-11-13 - pick file: PNSN v3 5120-sample continuous picker output from SeismicX-Cont-derived processing
- SNR conversion:
10*log10(SNR ratio) - SNR threshold:
4.25 dB - retained picks:
576,875 - confidence baseline: top
576,875picks byphase_prob, with no extra filters - REAL
-R:0.4/25/0.05/3/5 - REAL
-S:4/2/3/2/1.0/0.1/1.0 - no non-maximum suppression
- no association-stage probability threshold
- true-positive event: origin-time error <=
5 sand epicentral error <=30 km
Final summaries are included under results/continuous_association/.
Reported association results:
| Filter | Retained picks | Associated TP events | Reference events | Event recall |
|---|---|---|---|---|
| SNR >= 4.25 dB | 576,875 |
1,301 |
2,340 |
0.556 |
| Top phase probability | 576,875 |
1,561 |
2,340 |
0.667 |
A paired bootstrap over the 2,340 reference catalog events gives an SNR-minus-confidence event-recall difference of -0.111 with 95% CI [-0.127, -0.094].
Regenerate Manuscript Figures
The final manuscript figures are included in results/manuscript_figures/.
To regenerate Figure 2 and Figure 3 from the saved summaries, run from the original project layout or adapt the paths in:
python code/odata/make_grl_figures.py
That script expects the Overleaf manuscript directory and parent-project output layout used during analysis. For a standalone archive, the plotted summary CSV/JSON files in results/ are the stable reproduction artifacts.
Notes On Reproducibility Boundaries
- This package includes code, checkpoints, per-seed summaries, training logs, and manuscript-facing figures.
- It does not include the large data files needed to rerun training or association.
- It does not include paired per-sample prediction outputs for bootstrap confidence intervals.
- Subset-specific reference dispersion curves affect DispNet training parameterization only; all dispersion models are evaluated against the same unfiltered test samples, phase-velocity labels, valid masks, and period grids.