protodotdesign commited on
Commit
b2dc23c
·
verified ·
1 Parent(s): 6c1cfa3

Fix Readme

Browse files
Files changed (1) hide show
  1. README.md +156 -156
README.md CHANGED
@@ -1,157 +1,157 @@
1
- ---
2
- license: other
3
- license_name: nscl-a2sb-and-polyform-nc
4
- license_link: ./LICENSE.NSCL-A2SB
5
- tags:
6
- - audio
7
- - audio-restoration
8
- - schrodinger-bridge
9
- - diffusion
10
- - festival-audio
11
- - non-commercial
12
- library_name: pytorch
13
- pipeline_tag: audio-to-audio
14
- ---
15
-
16
- # Soundboard
17
-
18
- Schrödinger Bridge denoiser fine-tuned for festival audio restoration —
19
- recovers a soundboard-style mix from heavily-corrupted audience recordings
20
- (room reverb + audience-mic blend + lossy codec artifacts).
21
-
22
- Fine-tuned from NVIDIA's
23
- [A2SB](https://huggingface.co/nvidia/audio_to_audio_schrodinger_bridge)
24
- (`/root/locutius/training/checkpoints/model.pt` split) on a synthetic-corruption training pipeline driven
25
- by **profile-based augmentation** — corruption parameters are calibrated
26
- from real (clean, festival-recording) pairs and sampled at training time
27
- from the recovered distribution. See [Locutius](https://github.com/protodotdesign/locutius)
28
- for the full corruption chain, profiling, and training scaffold.
29
-
30
- ## Quick facts
31
-
32
- | | |
33
- |---|---|
34
- | Architecture | AttnUNetF (565.5M params) |
35
- | Audio format | 44.1 kHz, 2-channel, 32-bit float |
36
- | Segment length | 130560 samples (2.96 s) |
37
- | STFT | n_fft=2048, hop=512, window=hann |
38
- | Representation | 3-channel `[mag^0.25, cos(phase), sin(phase)]` |
39
- | Trained at step | 150,000 |
40
- | Base checkpoint | NVIDIA A2SB `/root/locutius/training/checkpoints/model.pt` |
41
- | Checkpoint size | 2.1 GB |
42
- | Diffusion | Schrödinger Bridge, β_max=1.0 |
43
-
44
- ## Usage
45
-
46
- Load with the [Locutius](https://github.com/protodotdesign/locutius)
47
- training package:
48
-
49
- ```python
50
- import torch
51
- from huggingface_hub import hf_hub_download
52
- from locutius_train.config import TrainConfig
53
- from locutius_train.network import AttnUNetF, SinusoidalTemporalEmbedding
54
- from locutius_train.diffusion import Diffusion
55
- from locutius_train.representation import WaveformToInput, InputToWaveform
56
- from locutius_train.restore import restore_spectrogram
57
-
58
- ckpt_path = hf_hub_download(repo_id="protodotdesign/Soundboard", filename="model.pt")
59
- sd = torch.load(ckpt_path, map_location="cuda", weights_only=False)
60
-
61
- cfg = TrainConfig()
62
- model = AttnUNetF(
63
- n_updown_levels=cfg.model.n_updown_levels,
64
- in_channels=cfg.model.in_channels,
65
- hidden_channels=list(cfg.model.hidden_channels),
66
- out_channels=cfg.model.out_channels,
67
- emb_channels=cfg.diffusion.n_timestep_channels,
68
- band_embedding_dim=cfg.model.band_embedding_dim,
69
- n_attn_heads=cfg.model.n_attn_heads,
70
- attention_levels=list(cfg.model.attention_levels),
71
- use_attn_input_norm=cfg.model.use_attn_input_norm,
72
- num_res_blocks=cfg.model.num_res_blocks,
73
- ).to("cuda").eval()
74
- model.load_state_dict(sd["model"])
75
- ```
76
-
77
- See `restore.py` in the Locutius repo for a complete CLI that takes a
78
- clean source, applies the calibrated festival-corruption profile, and
79
- runs the reverse Schrödinger Bridge to produce a restored output.
80
-
81
- ## Calibrated corruption profile
82
-
83
- This model was trained against a single calibrated profile recovered
84
- from a real (studio FLAC, festival M4A) pair via per-kick local
85
- Wiener deconvolution. The profile is bundled in `profile.json`:
86
-
87
- ```json
88
- {
89
- "name": "edc_festival",
90
- "ir_path": "../impulses/EchoThief/Brutalism/San Diego Supercomputer Center Outdoor Patio California.wav",
91
- "delay_ms_range": [
92
- 15.0,
93
- 25.0
94
- ],
95
- "studio_gain_range": [
96
- 0.6,
97
- 0.7
98
- ],
99
- "room_gain_range": [
100
- 0.55,
101
- 0.65
102
- ]
103
  }
104
- ```
105
-
106
- Each training-step corruption draws fresh values from these ranges,
107
- so the model has been exposed to ~50,000 distinct delay/blend
108
- combinations within the same venue character.
109
-
110
- ## Training data
111
-
112
- Trained on a focused subset of electronic music FLACs. **No festival
113
- recordings or other licensed audio were stored or distributed** —
114
- only the studio source material was used; festival-corrupted versions
115
- were synthesized on-the-fly from the calibrated profile during each
116
- training step.
117
-
118
- ## Limitations
119
-
120
- - **Single profile**: trained against one calibrated venue (`edc_festival`).
121
- Performance on festival recordings from very different venues / mix
122
- chains will degrade.
123
- - **Electronic music bias**: training set was EDM-heavy. Restoration
124
- quality on rock, classical, or vocal-led material may be uneven.
125
- - **No crowd-noise model**: the calibrated profile didn't include
126
- additive crowd-noise (no real crowd recordings were available
127
- during calibration). Recordings with heavy crowd vocals may have
128
- residual artifacts.
129
- - **Non-commercial use only** — see the license below.
130
-
131
- ## License
132
-
133
- Dual non-commercial license:
134
-
135
- - [NVIDIA Source Code License for A2SB](LICENSE.NSCL-A2SB) (the upstream
136
- license inherited from the A2SB base checkpoint)
137
- - [PolyForm Noncommercial 1.0.0](LICENSE.PolyForm-NC) (additional terms
138
- on top, source-availability + patent retaliation)
139
-
140
- You must comply with **both** licenses. Use is restricted to research
141
- and evaluation only — no commercial use is permitted. See
142
- [LICENSING.md](https://github.com/protodotdesign/locutius/blob/main/LICENSING.md)
143
- for the full plain-English breakdown.
144
-
145
- ## Citation
146
-
147
- If you use this model in research, please cite the upstream A2SB paper
148
- and reference this fine-tune:
149
-
150
- ```bibtex
151
- @misc{soundboard,
152
- title={Soundboard: festival audio restoration via profile-calibrated Schrödinger Bridge fine-tuning},
153
- author={Locutius},
154
- year={2026},
155
- howpublished={\url{https://huggingface.co/protodotdesign/Soundboard}},
156
- }
157
- ```
 
1
+ ---
2
+ license: other
3
+ license_name: nscl-a2sb-and-polyform-nc
4
+ license_link: https://raw.githubusercontent.com/NVIDIA/diffusion-audio-restoration/refs/heads/main/LICENSE
5
+ tags:
6
+ - audio
7
+ - audio-restoration
8
+ - schrodinger-bridge
9
+ - diffusion
10
+ - festival-audio
11
+ - non-commercial
12
+ library_name: pytorch
13
+ pipeline_tag: audio-to-audio
14
+ ---
15
+
16
+ # Soundboard
17
+
18
+ Schrödinger Bridge denoiser fine-tuned for musical recording audio restoration —
19
+ recovers a soundboard-style mix from heavily-corrupted audience recordings
20
+ (room reverb + audience-mic blend + lossy codec artifacts).
21
+
22
+ Fine-tuned from NVIDIA's
23
+ [A2SB](https://huggingface.co/nvidia/audio_to_audio_schrodinger_bridge)
24
+ (`twosplit_0.5_1.0` split) on a synthetic-corruption training pipeline driven
25
+ by **profile-based augmentation** — corruption parameters are calibrated
26
+ from real (clean, festival-recording) pairs and sampled at training time
27
+ from the recovered distribution. See [Locutius](https://github.com/protodotdesign/locutius)
28
+ for the full corruption chain, profiling, and training scaffold.
29
+
30
+ ## Quick facts
31
+
32
+ | | |
33
+ |---|---|
34
+ | Architecture | AttnUNetF (565.5M params) |
35
+ | Audio format | 44.1 kHz, 2-channel, 32-bit float |
36
+ | Segment length | 130560 samples (2.96 s) |
37
+ | STFT | n_fft=2048, hop=512, window=hann |
38
+ | Representation | 3-channel `[mag^0.25, cos(phase), sin(phase)]` |
39
+ | Trained at step | 50,000 |
40
+ | Base checkpoint | NVIDIA A2SB `twosplit_0.5_1.0` |
41
+ | Checkpoint size | 2.1 GB |
42
+ | Diffusion | Schrödinger Bridge, β_max=1.0 |
43
+
44
+ ## Usage
45
+
46
+ Load with the [Locutius](https://github.com/protodotdesign/locutius)
47
+ training package:
48
+
49
+ ```python
50
+ import torch
51
+ from huggingface_hub import hf_hub_download
52
+ from locutius_train.config import TrainConfig
53
+ from locutius_train.network import AttnUNetF, SinusoidalTemporalEmbedding
54
+ from locutius_train.diffusion import Diffusion
55
+ from locutius_train.representation import WaveformToInput, InputToWaveform
56
+ from locutius_train.restore import restore_spectrogram
57
+
58
+ ckpt_path = hf_hub_download(repo_id="protodotdesign/Soundboard", filename="model.pt")
59
+ sd = torch.load(ckpt_path, map_location="cuda", weights_only=False)
60
+
61
+ cfg = TrainConfig()
62
+ model = AttnUNetF(
63
+ n_updown_levels=cfg.model.n_updown_levels,
64
+ in_channels=cfg.model.in_channels,
65
+ hidden_channels=list(cfg.model.hidden_channels),
66
+ out_channels=cfg.model.out_channels,
67
+ emb_channels=cfg.diffusion.n_timestep_channels,
68
+ band_embedding_dim=cfg.model.band_embedding_dim,
69
+ n_attn_heads=cfg.model.n_attn_heads,
70
+ attention_levels=list(cfg.model.attention_levels),
71
+ use_attn_input_norm=cfg.model.use_attn_input_norm,
72
+ num_res_blocks=cfg.model.num_res_blocks,
73
+ ).to("cuda").eval()
74
+ model.load_state_dict(sd["model"])
75
+ ```
76
+
77
+ See `restore.py` in the Locutius repo for a complete CLI that takes a
78
+ clean source, applies the calibrated festival-corruption profile, and
79
+ runs the reverse Schrödinger Bridge to produce a restored output.
80
+
81
+ ## Calibrated corruption profile
82
+
83
+ This model was trained against a single calibrated profile recovered
84
+ from a real (studio FLAC, festival M4A) pair via per-kick local
85
+ Wiener deconvolution. The profile is bundled in `profile.json`:
86
+
87
+ ```json
88
+ {
89
+ "name": "edc_festival",
90
+ "ir_path": "../impulses/EchoThief/Brutalism/San Diego Supercomputer Center Outdoor Patio California.wav",
91
+ "delay_ms_range": [
92
+ 15.0,
93
+ 25.0
94
+ ],
95
+ "studio_gain_range": [
96
+ 0.6,
97
+ 0.7
98
+ ],
99
+ "room_gain_range": [
100
+ 0.55,
101
+ 0.65
102
+ ]
103
  }
104
+ ```
105
+
106
+ Each training-step corruption draws fresh values from these ranges,
107
+ so the model has been exposed to ~50,000 distinct delay/blend
108
+ combinations within the same venue character.
109
+
110
+ ## Training data
111
+
112
+ Trained on a focused subset of electronic music FLACs. **No festival
113
+ recordings or other licensed audio were stored or distributed** —
114
+ only the studio source material was used; festival-corrupted versions
115
+ were synthesized on-the-fly from the calibrated profile during each
116
+ training step.
117
+
118
+ ## Limitations
119
+
120
+ - **Single profile**: trained against one calibrated venue (`edc_festival`).
121
+ Performance on festival recordings from very different venues / mix
122
+ chains will degrade.
123
+ - **Electronic music bias**: training set was EDM-heavy. Restoration
124
+ quality on rock, classical, or vocal-led material may be uneven.
125
+ - **No crowd-noise model**: the calibrated profile didn't include
126
+ additive crowd-noise (no real crowd recordings were available
127
+ during calibration). Recordings with heavy crowd vocals may have
128
+ residual artifacts.
129
+ - **Non-commercial use only** — see the license below.
130
+
131
+ ## License
132
+
133
+ Dual non-commercial license:
134
+
135
+ - [NVIDIA Source Code License for A2SB](LICENSE.NSCL-A2SB) (the upstream
136
+ license inherited from the A2SB base checkpoint)
137
+ - [PolyForm Noncommercial 1.0.0](LICENSE.PolyForm-NC) (additional terms
138
+ on top, source-availability + patent retaliation)
139
+
140
+ You must comply with **both** licenses. Use is restricted to research
141
+ and evaluation only — no commercial use is permitted. See
142
+ [LICENSING.md](https://github.com/protodotdesign/locutius/blob/main/LICENSING.md)
143
+ for the full plain-English breakdown.
144
+
145
+ ## Citation
146
+
147
+ If you use this model in research, please cite the upstream A2SB paper
148
+ and reference this fine-tune:
149
+
150
+ ```bibtex
151
+ @misc{soundboard,
152
+ title={Soundboard: festival audio restoration via profile-calibrated Schrödinger Bridge fine-tuning},
153
+ author={Locutius},
154
+ year={2026},
155
+ howpublished={\url{https://huggingface.co/protodotdesign/Soundboard}},
156
+ }
157
+ ```