GRL SNR-Filtering Reproducibility Package

This package contains the code, trained model checkpoints, and derived result summaries used for the manuscript:

Signal-to-Noise Filtering Creates Deployment Bias in Seismic Deep Learning

Training and test datasets are intentionally not included. Download or access them from the data sources listed below, then pass their local paths to the scripts.

Package Contents

grl_publish/
  code/
    scripts/                  # Phase-picking, dispersion, aggregation, and utility scripts
    odata/                    # Continuous-pick filtering and REAL association helper scripts
    models/                   # BRNN/PNSN model definitions
    dispnet.v2.3.py           # Dispersion model definition
  checkpoints/
    base/pnsn.v3.pt           # Base PNSN checkpoint used for transfer learning
    phase_picker/seed*/       # Fine-tuned phase-picker checkpoints for three seeds
    dispersion/seed*/         # DispNet v2.3 checkpoints for three seeds
  results/
    phase_picker/seed*/       # Per-seed phase-picking summaries and training logs
    dispersion/seed*/         # Per-seed dispersion summaries and training logs
    multiseed/                # Three-seed aggregate tables
    continuous_association/   # Final SNR-vs-confidence association summaries
    bootstrap/                # Paired bootstrap inputs and confidence-interval tables
    manuscript_figures/       # Final figures used by the manuscript
  DATASETS.bib                # Dataset citations
  requirements.txt            # Minimal Python package list
  CHECKSUMS.sha256            # SHA-256 checksums for package files

No waveform HDF5 files, continuous pick JSONL files, station metadata, or other large data products are included.

Verify file integrity with:

shasum -a 256 -c CHECKSUMS.sha256

Data Sources

Use these data products to reproduce the analyses.

Continuous Waveform And Association Data

Dataset: SeismicX-Cont (Revision 96367f8)
URL: https://huggingface.co/datasets/cangyeone/SeismicX-Cont
DOI: 10.57967/hf/9006
Used for the two-day continuous association diagnostic and associated annotations.

Ambient-Noise Dispersion Data

Dataset: SeisDispFusion-NCF (Revision afcd805)
URL: https://huggingface.co/datasets/cangyeone/SeisDispFusion-NCF
DOI: 10.57967/hf/9114
Used for the NCF dispersion-estimation SNR filtering experiment.

CREDIT-X1local

Article: CREDIT-X1local: A reference dataset for machine learning seismology from ChinArray in Southwest China
DOI: 10.1016/j.eqs.2024.01.018
URL: https://www.equsci.org.cn/en/article/doi/10.1016/j.eqs.2024.01.018
Used for the phase-picking SNR transfer experiment.

BibTeX entries are provided in DATASETS.bib.

Recommended Citation

Please cite the GRL manuscript and the archived reproducibility package when reusing project-controlled materials from this archive:

@misc{yu2026seismicsnrfilteringbias,
  author = {Yu, Ziye},
  title = {seismic-snr-filtering-bias},
  year = {2026},
  url = {https://huggingface.co/cangyeone/seismic-snr-filtering-bias},
  doi = {10.57967/hf/9115},
  publisher = {Hugging Face},
  note = {CC BY 4.0}
}

Data sources used by the analyses should also be cited:

SeismicX-Cont, DOI 10.57967/hf/9006
SeisDispFusion-NCF, DOI 10.57967/hf/9114
CREDIT-X1local / Li et al. (2024), DOI 10.1016/j.eqs.2024.01.018

License

Unless otherwise noted, the project-controlled code, derived outputs, summary tables, figure assets, model checkpoints included in this archive, and reproducibility notes are released under the Creative Commons Attribution 4.0 International license (CC BY 4.0).

Under CC BY 4.0, reuse is permitted with appropriate credit, a link to the license, and indication of whether changes were made.

License: https://creativecommons.org/licenses/by/4.0/

SPDX-License-Identifier: CC-BY-4.0

Raw datasets that are not redistributed in this archive should be obtained from and cited through their original repositories or publications. This archive does not relicense third-party raw datasets or upstream waveform products.

Environment

Create an environment with Python 3.10 or newer, then install the core packages:

pip install -r requirements.txt

The phase-picking experiment can use Apple MPS if available. The dispersion script was run on CPU for the reported three-seed results.

Reproduce Phase-Picking Training

Expected local inputs:

CREDIT-X1local waveform file, for example /path/to/credit-x1.h5
CREDIT split keys, for example /path/to/creditkeys.npz
Base checkpoint included here: checkpoints/base/pnsn.v3.pt

Run the three seeds:

for seed in 20260609 20260610 20260611; do
  python code/scripts/snr_transfer_experiment.py \
    --h5 /path/to/credit-x1.h5 \
    --keys /path/to/creditkeys.npz \
    --base-ckpt checkpoints/base/pnsn.v3.pt \
    --out-dir results/phase_picker/seed${seed}_rerun \
    --seed ${seed}
done

The manuscript run used:

--train-steps 2000
--train-batch 16
--eval-samples 10000
train records per condition: 60,837
conditions: full matched, SNR>5 dB matched, SNR>10 dB matched

The published trained checkpoints are already included under checkpoints/phase_picker/seed*/.

Reproduce Dispersion Training

Expected local input:

SeisDispFusion-NCF HDF5, for example /path/to/ncf_disp_dataset_with_disp_image.h5

Run the three seeds:

for seed in 20260609 20260610 20260611; do
  python code/scripts/disp_snr_transfer_experiment.py \
    --h5 /path/to/ncf_disp_dataset_with_disp_image.h5 \
    --out-dir results/dispersion/seed${seed}_rerun \
    --seed ${seed} \
    --epochs 5 \
    --batch-size 256 \
    --device cpu
done

The manuscript run used:

train samples per condition: 11,033
test samples: 8,292
conditions: full matched, SNR>3.04 dB matched, SNR>6.77 dB matched
learning rate: 2e-4
optimizer: AdamW

The published trained checkpoints are already included under checkpoints/dispersion/seed*/.

Aggregate The Three Seeds

After rerunning both tasks, aggregate the per-seed summaries:

python code/scripts/grl_aggregate_multiseed.py \
  --seeds 20260609 20260610 20260611 \
  --out-dir results/multiseed_rerun

The manuscript aggregate outputs are included in results/multiseed/.

Reported training-task results:

Task	Condition	Main Metric
Phase picking	Full matched	Mean F1 = `0.744 +/- 0.005`
Phase picking	SNR>5 dB matched	Mean F1 = `0.735 +/- 0.006`
Phase picking	SNR>10 dB matched	Mean F1 = `0.699 +/- 0.007`
Dispersion	Full matched	MAE = `0.0462 +/- 0.0008 km/s`
Dispersion	SNR>3.04 dB matched	MAE = `0.0479 +/- 0.0014 km/s`
Dispersion	SNR>6.77 dB matched	MAE = `0.0510 +/- 0.0004 km/s`

The +/- values are sample standard deviations across three seeds. Paired bootstrap tables for the deterministic seed-20260609 evaluation are included in results/bootstrap/.

Sample-level paired bootstrap checks:

Task	Comparison	Difference vs full/confidence	95% CI
Phase picking	SNR>5 dB mean F1	`-0.0076`	`[-0.0097, -0.0055]`
Phase picking	SNR>10 dB mean F1	`-0.0470`	`[-0.0498, -0.0441]`
Dispersion	SNR>3.04 dB MAE	`+0.00315 km/s`	`[+0.00276, +0.00353]`
Dispersion	SNR>6.77 dB MAE	`+0.00479 km/s`	`[+0.00428, +0.00530]`

These bootstrap intervals use fixed seed-20260609 trained checkpoints and shared deterministic test samples. They do not replace the three-seed summaries.

Continuous Association Diagnostic

Code for filtering continuous picks and running REAL association is in code/odata/.

The manuscript diagnostic used:

days: 2019-07-06 and 2021-11-13
pick file: PNSN v3 5120-sample continuous picker output from SeismicX-Cont-derived processing
SNR conversion: 10*log10(SNR ratio)
SNR threshold: 4.25 dB
retained picks: 576,875
confidence baseline: top 576,875 picks by phase_prob, with no extra filters
REAL -R: 0.4/25/0.05/3/5
REAL -S: 4/2/3/2/1.0/0.1/1.0
no non-maximum suppression
no association-stage probability threshold
true-positive event: origin-time error <= 5 s and epicentral error <= 30 km

Final summaries are included under results/continuous_association/.

Reported association results:

Filter	Retained picks	Associated TP events	Reference events	Event recall
SNR >= 4.25 dB	`576,875`	`1,301`	`2,340`	`0.556`
Top phase probability	`576,875`	`1,561`	`2,340`	`0.667`

A paired bootstrap over the 2,340 reference catalog events gives an SNR-minus-confidence event-recall difference of -0.111 with 95% CI [-0.127, -0.094].

Regenerate Manuscript Figures

The final manuscript figures are included in results/manuscript_figures/.

To regenerate Figure 2 and Figure 3 from the saved summaries, run from the original project layout or adapt the paths in:

python code/odata/make_grl_figures.py

That script expects the Overleaf manuscript directory and parent-project output layout used during analysis. For a standalone archive, the plotted summary CSV/JSON files in results/ are the stable reproduction artifacts.

Notes On Reproducibility Boundaries

This package includes code, checkpoints, per-seed summaries, training logs, and manuscript-facing figures.
It does not include the large data files needed to rerun training or association.
It does not include paired per-sample prediction outputs for bootstrap confidence intervals.
Subset-specific reference dispersion curves affect DispNet training parameterization only; all dispersion models are evaluated against the same unfiltered test samples, phase-velocity labels, valid masks, and period grids.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support