GRL SNR-Filtering Reproducibility Package

This package contains the code, trained model checkpoints, and derived result summaries used for the manuscript:

Signal-to-Noise Filtering Creates Deployment Bias in Seismic Deep Learning

Training and test datasets are intentionally not included. Download or access them from the data sources listed below, then pass their local paths to the scripts.

Package Contents

grl_publish/
  code/
    scripts/                  # Phase-picking, dispersion, aggregation, and utility scripts
    odata/                    # Continuous-pick filtering and REAL association helper scripts
    models/                   # BRNN/PNSN model definitions
    dispnet.v2.3.py           # Dispersion model definition
  checkpoints/
    base/pnsn.v3.pt           # Base PNSN checkpoint used for transfer learning
    phase_picker/seed*/       # Fine-tuned phase-picker checkpoints for three seeds
    dispersion/seed*/         # DispNet v2.3 checkpoints for three seeds
  results/
    phase_picker/seed*/       # Per-seed phase-picking summaries and training logs
    dispersion/seed*/         # Per-seed dispersion summaries and training logs
    multiseed/                # Three-seed aggregate tables
    continuous_association/   # Final SNR-vs-confidence association summaries
    bootstrap/                # Paired bootstrap inputs and confidence-interval tables
    manuscript_figures/       # Final figures used by the manuscript
  DATASETS.bib                # Dataset citations
  requirements.txt            # Minimal Python package list
  CHECKSUMS.sha256            # SHA-256 checksums for package files

No waveform HDF5 files, continuous pick JSONL files, station metadata, or other large data products are included.

Verify file integrity with:

shasum -a 256 -c CHECKSUMS.sha256

Data Sources

Use these data products to reproduce the analyses.

Continuous Waveform And Association Data

Ambient-Noise Dispersion Data

CREDIT-X1local

BibTeX entries are provided in DATASETS.bib.

Recommended Citation

Please cite the GRL manuscript and the archived reproducibility package when reusing project-controlled materials from this archive:

@misc{yu2026seismicsnrfilteringbias,
  author = {Yu, Ziye},
  title = {seismic-snr-filtering-bias},
  year = {2026},
  url = {https://huggingface.co/cangyeone/seismic-snr-filtering-bias},
  doi = {10.57967/hf/9115},
  publisher = {Hugging Face},
  note = {CC BY 4.0}
}

Data sources used by the analyses should also be cited:

  • SeismicX-Cont, DOI 10.57967/hf/9006
  • SeisDispFusion-NCF, DOI 10.57967/hf/9114
  • CREDIT-X1local / Li et al. (2024), DOI 10.1016/j.eqs.2024.01.018

License

Unless otherwise noted, the project-controlled code, derived outputs, summary tables, figure assets, model checkpoints included in this archive, and reproducibility notes are released under the Creative Commons Attribution 4.0 International license (CC BY 4.0).

Under CC BY 4.0, reuse is permitted with appropriate credit, a link to the license, and indication of whether changes were made.

License: https://creativecommons.org/licenses/by/4.0/

SPDX-License-Identifier: CC-BY-4.0

Raw datasets that are not redistributed in this archive should be obtained from and cited through their original repositories or publications. This archive does not relicense third-party raw datasets or upstream waveform products.

Environment

Create an environment with Python 3.10 or newer, then install the core packages:

pip install -r requirements.txt

The phase-picking experiment can use Apple MPS if available. The dispersion script was run on CPU for the reported three-seed results.

Reproduce Phase-Picking Training

Expected local inputs:

  • CREDIT-X1local waveform file, for example /path/to/credit-x1.h5
  • CREDIT split keys, for example /path/to/creditkeys.npz
  • Base checkpoint included here: checkpoints/base/pnsn.v3.pt

Run the three seeds:

for seed in 20260609 20260610 20260611; do
  python code/scripts/snr_transfer_experiment.py \
    --h5 /path/to/credit-x1.h5 \
    --keys /path/to/creditkeys.npz \
    --base-ckpt checkpoints/base/pnsn.v3.pt \
    --out-dir results/phase_picker/seed${seed}_rerun \
    --seed ${seed}
done

The manuscript run used:

  • --train-steps 2000
  • --train-batch 16
  • --eval-samples 10000
  • train records per condition: 60,837
  • conditions: full matched, SNR>5 dB matched, SNR>10 dB matched

The published trained checkpoints are already included under checkpoints/phase_picker/seed*/.

Reproduce Dispersion Training

Expected local input:

  • SeisDispFusion-NCF HDF5, for example /path/to/ncf_disp_dataset_with_disp_image.h5

Run the three seeds:

for seed in 20260609 20260610 20260611; do
  python code/scripts/disp_snr_transfer_experiment.py \
    --h5 /path/to/ncf_disp_dataset_with_disp_image.h5 \
    --out-dir results/dispersion/seed${seed}_rerun \
    --seed ${seed} \
    --epochs 5 \
    --batch-size 256 \
    --device cpu
done

The manuscript run used:

  • train samples per condition: 11,033
  • test samples: 8,292
  • conditions: full matched, SNR>3.04 dB matched, SNR>6.77 dB matched
  • learning rate: 2e-4
  • optimizer: AdamW

The published trained checkpoints are already included under checkpoints/dispersion/seed*/.

Aggregate The Three Seeds

After rerunning both tasks, aggregate the per-seed summaries:

python code/scripts/grl_aggregate_multiseed.py \
  --seeds 20260609 20260610 20260611 \
  --out-dir results/multiseed_rerun

The manuscript aggregate outputs are included in results/multiseed/.

Reported training-task results:

Task Condition Main Metric
Phase picking Full matched Mean F1 = 0.744 +/- 0.005
Phase picking SNR>5 dB matched Mean F1 = 0.735 +/- 0.006
Phase picking SNR>10 dB matched Mean F1 = 0.699 +/- 0.007
Dispersion Full matched MAE = 0.0462 +/- 0.0008 km/s
Dispersion SNR>3.04 dB matched MAE = 0.0479 +/- 0.0014 km/s
Dispersion SNR>6.77 dB matched MAE = 0.0510 +/- 0.0004 km/s

The +/- values are sample standard deviations across three seeds. Paired bootstrap tables for the deterministic seed-20260609 evaluation are included in results/bootstrap/.

Sample-level paired bootstrap checks:

Task Comparison Difference vs full/confidence 95% CI
Phase picking SNR>5 dB mean F1 -0.0076 [-0.0097, -0.0055]
Phase picking SNR>10 dB mean F1 -0.0470 [-0.0498, -0.0441]
Dispersion SNR>3.04 dB MAE +0.00315 km/s [+0.00276, +0.00353]
Dispersion SNR>6.77 dB MAE +0.00479 km/s [+0.00428, +0.00530]

These bootstrap intervals use fixed seed-20260609 trained checkpoints and shared deterministic test samples. They do not replace the three-seed summaries.

Continuous Association Diagnostic

Code for filtering continuous picks and running REAL association is in code/odata/.

The manuscript diagnostic used:

  • days: 2019-07-06 and 2021-11-13
  • pick file: PNSN v3 5120-sample continuous picker output from SeismicX-Cont-derived processing
  • SNR conversion: 10*log10(SNR ratio)
  • SNR threshold: 4.25 dB
  • retained picks: 576,875
  • confidence baseline: top 576,875 picks by phase_prob, with no extra filters
  • REAL -R: 0.4/25/0.05/3/5
  • REAL -S: 4/2/3/2/1.0/0.1/1.0
  • no non-maximum suppression
  • no association-stage probability threshold
  • true-positive event: origin-time error <= 5 s and epicentral error <= 30 km

Final summaries are included under results/continuous_association/.

Reported association results:

Filter Retained picks Associated TP events Reference events Event recall
SNR >= 4.25 dB 576,875 1,301 2,340 0.556
Top phase probability 576,875 1,561 2,340 0.667

A paired bootstrap over the 2,340 reference catalog events gives an SNR-minus-confidence event-recall difference of -0.111 with 95% CI [-0.127, -0.094].

Regenerate Manuscript Figures

The final manuscript figures are included in results/manuscript_figures/.

To regenerate Figure 2 and Figure 3 from the saved summaries, run from the original project layout or adapt the paths in:

python code/odata/make_grl_figures.py

That script expects the Overleaf manuscript directory and parent-project output layout used during analysis. For a standalone archive, the plotted summary CSV/JSON files in results/ are the stable reproduction artifacts.

Notes On Reproducibility Boundaries

  • This package includes code, checkpoints, per-seed summaries, training logs, and manuscript-facing figures.
  • It does not include the large data files needed to rerun training or association.
  • It does not include paired per-sample prediction outputs for bootstrap confidence intervals.
  • Subset-specific reference dispersion curves affect DispNet training parameterization only; all dispersion models are evaluated against the same unfiltered test samples, phase-velocity labels, valid masks, and period grids.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support