protodotdesign
/

Soundboard

@@ -1,157 +1,157 @@
-        ---
-        license: other
-        license_name: nscl-a2sb-and-polyform-nc
-        license_link: ./LICENSE.NSCL-A2SB
-        tags:
-          - audio
-          - audio-restoration
-          - schrodinger-bridge
-          - diffusion
-          - festival-audio
-          - non-commercial
-        library_name: pytorch
-        pipeline_tag: audio-to-audio
-        ---
-        # Soundboard
-        Schrödinger Bridge denoiser fine-tuned for festival audio restoration —
-        recovers a soundboard-style mix from heavily-corrupted audience recordings
-        (room reverb + audience-mic blend + lossy codec artifacts).
-        Fine-tuned from NVIDIA's
-        [A2SB](https://huggingface.co/nvidia/audio_to_audio_schrodinger_bridge)
-        (`/root/locutius/training/checkpoints/model.pt` split) on a synthetic-corruption training pipeline driven
-        by **profile-based augmentation** — corruption parameters are calibrated
-        from real (clean, festival-recording) pairs and sampled at training time
-        from the recovered distribution. See [Locutius](https://github.com/protodotdesign/locutius)
-        for the full corruption chain, profiling, and training scaffold.
-        ## Quick facts
-        | | |
-        |---|---|
-        | Architecture | AttnUNetF (565.5M params) |
-        | Audio format | 44.1 kHz, 2-channel, 32-bit float |
-        | Segment length | 130560 samples (2.96 s) |
-        | STFT | n_fft=2048, hop=512, window=hann |
-        | Representation | 3-channel `[mag^0.25, cos(phase), sin(phase)]` |
-        | Trained at step | 150,000 |
-        | Base checkpoint | NVIDIA A2SB `/root/locutius/training/checkpoints/model.pt` |
-        | Checkpoint size | 2.1 GB |
-        | Diffusion | Schrödinger Bridge, β_max=1.0 |
-        ## Usage
-        Load with the [Locutius](https://github.com/protodotdesign/locutius)
-        training package:
-        ```python
-        import torch
-        from huggingface_hub import hf_hub_download
-        from locutius_train.config import TrainConfig
-        from locutius_train.network import AttnUNetF, SinusoidalTemporalEmbedding
-        from locutius_train.diffusion import Diffusion
-        from locutius_train.representation import WaveformToInput, InputToWaveform
-        from locutius_train.restore import restore_spectrogram
-        ckpt_path = hf_hub_download(repo_id="protodotdesign/Soundboard", filename="model.pt")
-        sd = torch.load(ckpt_path, map_location="cuda", weights_only=False)
-        cfg = TrainConfig()
-        model = AttnUNetF(
-            n_updown_levels=cfg.model.n_updown_levels,
-            in_channels=cfg.model.in_channels,
-            hidden_channels=list(cfg.model.hidden_channels),
-            out_channels=cfg.model.out_channels,
-            emb_channels=cfg.diffusion.n_timestep_channels,
-            band_embedding_dim=cfg.model.band_embedding_dim,
-            n_attn_heads=cfg.model.n_attn_heads,
-            attention_levels=list(cfg.model.attention_levels),
-            use_attn_input_norm=cfg.model.use_attn_input_norm,
-            num_res_blocks=cfg.model.num_res_blocks,
-        ).to("cuda").eval()
-        model.load_state_dict(sd["model"])
-        ```
-        See `restore.py` in the Locutius repo for a complete CLI that takes a
-        clean source, applies the calibrated festival-corruption profile, and
-        runs the reverse Schrödinger Bridge to produce a restored output.
-        ## Calibrated corruption profile
-        This model was trained against a single calibrated profile recovered
-        from a real (studio FLAC, festival M4A) pair via per-kick local
-        Wiener deconvolution. The profile is bundled in `profile.json`:
-        ```json
-        {
-  "name": "edc_festival",
-  "ir_path": "../impulses/EchoThief/Brutalism/San Diego Supercomputer Center Outdoor Patio California.wav",
-  "delay_ms_range": [
-    15.0,
-    25.0
-  ],
-  "studio_gain_range": [
-    0.6,
-    0.7
-  ],
-  "room_gain_range": [
-    0.55,
-    0.65
-  ]
 }
-        ```
-        Each training-step corruption draws fresh values from these ranges,
-        so the model has been exposed to ~50,000 distinct delay/blend
-        combinations within the same venue character.
-        ## Training data
-        Trained on a focused subset of electronic music FLACs. **No festival
-        recordings or other licensed audio were stored or distributed** —
-        only the studio source material was used; festival-corrupted versions
-        were synthesized on-the-fly from the calibrated profile during each
-        training step.
-        ## Limitations
-        - **Single profile**: trained against one calibrated venue (`edc_festival`).
-          Performance on festival recordings from very different venues / mix
-          chains will degrade.
-        - **Electronic music bias**: training set was EDM-heavy. Restoration
-          quality on rock, classical, or vocal-led material may be uneven.
-        - **No crowd-noise model**: the calibrated profile didn't include
-          additive crowd-noise (no real crowd recordings were available
-          during calibration). Recordings with heavy crowd vocals may have
-          residual artifacts.
-        - **Non-commercial use only** — see the license below.
-        ## License
-        Dual non-commercial license:
-        - [NVIDIA Source Code License for A2SB](LICENSE.NSCL-A2SB) (the upstream
-          license inherited from the A2SB base checkpoint)
-        - [PolyForm Noncommercial 1.0.0](LICENSE.PolyForm-NC) (additional terms
-          on top, source-availability + patent retaliation)
-        You must comply with **both** licenses. Use is restricted to research
-        and evaluation only — no commercial use is permitted. See
-        [LICENSING.md](https://github.com/protodotdesign/locutius/blob/main/LICENSING.md)
-        for the full plain-English breakdown.
-        ## Citation
-        If you use this model in research, please cite the upstream A2SB paper
-        and reference this fine-tune:
-        ```bibtex
-        @misc{soundboard,
-          title={Soundboard: festival audio restoration via profile-calibrated Schrödinger Bridge fine-tuning},
-          author={Locutius},
-          year={2026},
-          howpublished={\url{https://huggingface.co/protodotdesign/Soundboard}},
-        }
-        ```

+---
+license: other
+license_name: nscl-a2sb-and-polyform-nc
+license_link: https://raw.githubusercontent.com/NVIDIA/diffusion-audio-restoration/refs/heads/main/LICENSE
+tags:
+  - audio
+  - audio-restoration
+  - schrodinger-bridge
+  - diffusion
+  - festival-audio
+  - non-commercial
+library_name: pytorch
+pipeline_tag: audio-to-audio
+---
+# Soundboard
+Schrödinger Bridge denoiser fine-tuned for musical recording audio restoration —
+recovers a soundboard-style mix from heavily-corrupted audience recordings
+(room reverb + audience-mic blend + lossy codec artifacts).
+Fine-tuned from NVIDIA's
+[A2SB](https://huggingface.co/nvidia/audio_to_audio_schrodinger_bridge)
+(`twosplit_0.5_1.0` split) on a synthetic-corruption training pipeline driven
+by **profile-based augmentation** — corruption parameters are calibrated
+from real (clean, festival-recording) pairs and sampled at training time
+from the recovered distribution. See [Locutius](https://github.com/protodotdesign/locutius)
+for the full corruption chain, profiling, and training scaffold.
+## Quick facts
+| | |
+|---|---|
+| Architecture | AttnUNetF (565.5M params) |
+| Audio format | 44.1 kHz, 2-channel, 32-bit float |
+| Segment length | 130560 samples (2.96 s) |
+| STFT | n_fft=2048, hop=512, window=hann |
+| Representation | 3-channel `[mag^0.25, cos(phase), sin(phase)]` |
+| Trained at step | 50,000 |
+| Base checkpoint | NVIDIA A2SB `twosplit_0.5_1.0` |
+| Checkpoint size | 2.1 GB |
+| Diffusion | Schrödinger Bridge, β_max=1.0 |
+## Usage
+Load with the [Locutius](https://github.com/protodotdesign/locutius)
+training package:
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from locutius_train.config import TrainConfig
+from locutius_train.network import AttnUNetF, SinusoidalTemporalEmbedding
+from locutius_train.diffusion import Diffusion
+from locutius_train.representation import WaveformToInput, InputToWaveform
+from locutius_train.restore import restore_spectrogram
+ckpt_path = hf_hub_download(repo_id="protodotdesign/Soundboard", filename="model.pt")
+sd = torch.load(ckpt_path, map_location="cuda", weights_only=False)
+cfg = TrainConfig()
+model = AttnUNetF(
+    n_updown_levels=cfg.model.n_updown_levels,
+    in_channels=cfg.model.in_channels,
+    hidden_channels=list(cfg.model.hidden_channels),
+    out_channels=cfg.model.out_channels,
+    emb_channels=cfg.diffusion.n_timestep_channels,
+    band_embedding_dim=cfg.model.band_embedding_dim,
+    n_attn_heads=cfg.model.n_attn_heads,
+    attention_levels=list(cfg.model.attention_levels),
+    use_attn_input_norm=cfg.model.use_attn_input_norm,
+    num_res_blocks=cfg.model.num_res_blocks,
+).to("cuda").eval()
+model.load_state_dict(sd["model"])
+```
+See `restore.py` in the Locutius repo for a complete CLI that takes a
+clean source, applies the calibrated festival-corruption profile, and
+runs the reverse Schrödinger Bridge to produce a restored output.
+## Calibrated corruption profile
+This model was trained against a single calibrated profile recovered
+from a real (studio FLAC, festival M4A) pair via per-kick local
+Wiener deconvolution. The profile is bundled in `profile.json`:
+```json
+{
+"name": "edc_festival",
+"ir_path": "../impulses/EchoThief/Brutalism/San Diego Supercomputer Center Outdoor Patio California.wav",
+"delay_ms_range": [
+15.0,
+25.0
+],
+"studio_gain_range": [
+0.6,
+0.7
+],
+"room_gain_range": [
+0.55,
+0.65
+]
 }
+```
+Each training-step corruption draws fresh values from these ranges,
+so the model has been exposed to ~50,000 distinct delay/blend
+combinations within the same venue character.
+## Training data
+Trained on a focused subset of electronic music FLACs. **No festival
+recordings or other licensed audio were stored or distributed** —
+only the studio source material was used; festival-corrupted versions
+were synthesized on-the-fly from the calibrated profile during each
+training step.
+## Limitations
+- **Single profile**: trained against one calibrated venue (`edc_festival`).
+  Performance on festival recordings from very different venues / mix
+  chains will degrade.
+- **Electronic music bias**: training set was EDM-heavy. Restoration
+  quality on rock, classical, or vocal-led material may be uneven.
+- **No crowd-noise model**: the calibrated profile didn't include
+  additive crowd-noise (no real crowd recordings were available
+  during calibration). Recordings with heavy crowd vocals may have
+  residual artifacts.
+- **Non-commercial use only** — see the license below.
+## License
+Dual non-commercial license:
+- [NVIDIA Source Code License for A2SB](LICENSE.NSCL-A2SB) (the upstream
+  license inherited from the A2SB base checkpoint)
+- [PolyForm Noncommercial 1.0.0](LICENSE.PolyForm-NC) (additional terms
+  on top, source-availability + patent retaliation)
+You must comply with **both** licenses. Use is restricted to research
+and evaluation only — no commercial use is permitted. See
+[LICENSING.md](https://github.com/protodotdesign/locutius/blob/main/LICENSING.md)
+for the full plain-English breakdown.
+## Citation
+If you use this model in research, please cite the upstream A2SB paper
+and reference this fine-tune:
+```bibtex
+@misc{soundboard,
+  title={Soundboard: festival audio restoration via profile-calibrated Schrödinger Bridge fine-tuning},
+  author={Locutius},
+  year={2026},
+  howpublished={\url{https://huggingface.co/protodotdesign/Soundboard}},
+}
+```