Diamond — Speech Restoration

Diamond restores degraded speech to near-studio quality. It takes low-quality input — codec artifacts, band-limited telephony audio, lossy compression — and reconstructs a clean 44.1 kHz waveform.

Status — research prototype, work in progress. Diamond is an early-stage prototype, not a finished model. The published checkpoint comes from an in-progress pilot training run that has not converged. Results are preliminary and expected to improve as training continues. The goal of this pilot is to validate the architecture and the augmentation pipeline, not to deliver production-quality restoration.

Files in this repo

File	Role
`step_00100000.safetensors`	Model weights (encoder + decoder).
`step_00100000.json`	Architecture config sidecar — must sit next to the weights.

Model overview


Name	Diamond
Task	Speech-to-speech restoration (audio-to-audio)
Input	Degraded speech, any sample rate (`.mp3` / `.wav` / `.flac`)
Output	Restored speech, 44.1 kHz mono `.wav`
Trainable parameters	~98.5M (encoder ~38.7M + decoder ~59.8M); frozen DAC ~74M separate
Precision	bf16-mixed
Status	Pilot — research prototype, training in progress

Architecture

degraded audio → resample to 24 kHz → log-mel spectrogram (80 bins, fmax 8 kHz, hop 280 → ~86 Hz frame rate) → Transformer encoder (12 layers, d=512, no downsampling, RoPE + RMSNorm + GELU) → Transformer decoder (12 layers, d=512, causal self-attn + cross-attn) predicts DAC tokens with a MusicGen-style delayed pattern → DAC tokens (9 RVQ codebooks @ ~86 Hz) → DAC decoder (frozen) → restored waveform @ 44.1 kHz

Encoder keeps full temporal resolution (no stride-2 downsampling) so that fine detail — transients, sibilants, onsets — survives into the decoder.
Decoder predicts 9 codebooks per frame in a single forward pass using a delayed-pattern shift, preserving the residual-vector-quantization causality of the DAC codec.
Codec: Descript Audio Codec 44.1 kHz, 9 codebooks, used frozen as both the training target encoder and the final waveform decoder.

Encoder mel frame rate (~86 Hz) is aligned 1:1 with the DAC token frame rate, so cross-attention does not have to bridge a time-resolution mismatch.

Training data

Studio source: LibriTTS-R (clean studio-quality English speech).
Degradation: a DSP augmentor (EmoliaAugmentorV4) simulates realistic degradation — codec compression, band-limiting, dynamic-range and EQ changes — by retrieving per-sample parameters from a calibrated palette matched to a real degraded-speech distribution. Noise injection and noise transfer are disabled: degradation is codec + EQ only.
Training pairs are synchronized by construction: the same studio clip is the clean target and the source of the degraded input.

How to run inference

Use the S2S-inference-diamond repository. Two commands from a fresh clone:

git clone https://github.com/KaniTTS-research-team/S2S-inference-diamond.git
cd S2S-inference-diamond

make setup                                       # uv + ffmpeg + venv + deps + DAC + this checkpoint
make infer INPUT=degraded.mp3                    # → degraded_restored.wav
make setup pulls step_00100000.safetensors + step_00100000.json from
this repo into ckpt/, so no manual download is needed. For a folder of
files: make batch-infer INPUT=in_dir/ OUTPUT=out_dir/. DNSMOS scoring:
make eval.

Windows / PowerShell instructions and CUDA-build notes are in the inference
repo's README.

How to continue training
Use the S2S-train-diamond
repository. Two commands to train from scratch:


git clone https://github.com/KaniTTS-research-team/S2S-train-diamond.git
cd S2S-train-diamond

make setup                       # uv + ffmpeg + venv + deps + DAC + augmentor
make train CONFIG=pilot          # auto-tokenizes the studio corpus, then trains
To fine-tune from this published checkpoint instead of training from
scratch, download the .safetensors file and pass it to make resume:


mkdir -p ckpt
uv run python -c "from huggingface_hub import hf_hub_download; \
hf_hub_download('kyrgyz-ai/diamond-s2s', 'step_00100000.safetensors', local_dir='ckpt/')"
make resume CHECKPOINT=ckpt/step_00100000.safetensors
.safetensors contains weights only (no optimizer / scheduler state), so
this is a warm-start, not a bit-exact resume: Adam moments and the LR
schedule start fresh. Expect a brief loss bump during the first 1–2k steps
while Adam re-accumulates moments. For a strict resume without that bump
you need a .pt checkpoint with full state — not currently published.

Intended use
Restoring lossy / codec-degraded / band-limited recordings of English speech.
Pre-processing speech before downstream tasks where audio quality matters.
Limitations
English-only training data; other languages are out of distribution.
Pilot / prototype stage — training has not converged; the checkpoint is a snapshot of an in-progress experiment, not a finished model.
The model performs bandwidth extension (reconstructing content above the input band): high-frequency detail is generated, not recovered, and may not match the original recording.
No noise removal — the training augmentor's noise injection is disabled (codec + EQ only), so the model is not trained to clean noisy input. Use a dedicated denoiser upstream if needed.
Not designed for non-speech audio (music, environmental sound).
Greedy decoding only — no beam search or sampling.
License
Apache-2.0. The Descript Audio Codec is used under its own license.

Downloads last month: -; Downloads are not tracked for this model. How to track