Sidon — Call-Centre / Telephony Speech Restoration

Restore narrowband, codec'd, noisy call-centre / telephony speech (e.g. 8 kHz G.711/GSM phone audio) to clean 48 kHz. Two stages, both trained for the telephony domain (Malaysian/Singaporean

multilingual clean teachers):

input (8-16 kHz telephony) --16k--> [FE: 24-layer w2v-BERT 2.0 + LoRA] --features[T,1024]-->
      [DAC decoder, 188M] --> 48 kHz clean waveform

The FE LoRA adapter is merged into the base weights at load time, so inference needs no peft — just transformers + descript-audio-codec.

Quick start — infer from the HF checkpoint

pip install torch torchaudio "transformers>=4.56" "descript-audio-codec>=1.0.0" soundfile "huggingface_hub[cli]"

# pull the CLI + the two slim checkpoints from the Hub
hf auth login   # private repo: log in first (or export HF_TOKEN=hf_...)
hf download Scicom-intl/sidon-callcentre \
    infer_callcentre.py fe_callcentre/fe_adapter_full.pt \
    decoder_callcentre/decoder_only.pt --local-dir sidon-callcentre

cd sidon-callcentre && python infer_callcentre.py \
    --input your_call.wav --out-dir out \
    --fe-adapter fe_callcentre/fe_adapter_full.pt \
    --decoder decoder_callcentre/decoder_only.pt \
    --chunk 0 --device cuda   # --chunk 0 = NO chunking (default single pass); --device cpu if no GPU
# -> out/your_call_restored48k.wav (clean 48 kHz)  +  out/your_call_orig48k.wav (A/B)

Prefer Python (load weights from the Hub with hf_hub_download)? See Python below.

Status: the decoder is still training (~step 30k of 100k) and these checkpoints are refreshed periodically — quality keeps improving. It already restores real 8 kHz call-centre audio well.

Files

Use the current-run checkpoints under fe_callcentre/ and decoder_callcentre/:

path	role	size
`fe_callcentre/fe_adapter_full.pt`	FE adapter (inference) — 144 tensors: 96 LoRA + 48 trained `output_dense` biases	~63 MB
`decoder_callcentre/decoder_only.pt`	decoder (inference) — 188M DAC decoder	~0.75 GB
`fe_callcentre/last.pt`, `decoder_callcentre/last.pt`	raw checkpoints (resume training)	~2.5 / 2.8 GB
`infer_callcentre.py`	inference CLI (below)	—

For inference you only need the two slim files + infer_callcentre.py. (Root-level fe_adapter_full.pt / decoder_only.pt are from an earlier run and are superseded.)

End-to-end example (straight from HuggingFace)

pip install torch torchaudio "transformers>=4.56" "descript-audio-codec>=1.0.0" soundfile "huggingface_hub[cli]"

# pull the CLI + the two slim checkpoints, straight from this repo
hf auth login   # private repo: log in first (or export HF_TOKEN=hf_...)
hf download Scicom-intl/sidon-callcentre \
    infer_callcentre.py \
    fe_callcentre/fe_adapter_full.pt \
    decoder_callcentre/decoder_only.pt \
    --local-dir sidon-callcentre
cd sidon-callcentre

# restore your audio end-to-end
python infer_callcentre.py \
    --input your_call.wav \
    --out-dir out \
    --fe-adapter fe_callcentre/fe_adapter_full.pt \
    --decoder   decoder_callcentre/decoder_only.pt \
    --chunk 0 --device cuda   # --chunk 0 = NO chunking (single straight pass, default); --device cpu if no GPU

Outputs:

out/your_call_restored48k.wav — the restored clean 48 kHz speech.
out/your_call_orig48k.wav — the input, naively upsampled to 48 kHz (no model), for an A/B listen.

--input accepts a file or a directory (.wav/.flac/.mp3/.ogg/.opus/.m4a). Stereo (e.g. agent/customer on separate channels) is restored per channel and recombined.

Inference is a single straight pass (--chunk 0, the default): w2v-BERT 2.0 uses relative/rotary position embeddings and the DAC decoder is fully convolutional, so a full pass is length-invariant and cleanest. --chunk <seconds> enables crossfaded windowing purely as a memory fallback for very long audio (self-attention is O(T^2)); it is spectrally near-identical (log-mel corr ≈ 0.98) but adds seams, so prefer the default single pass unless you hit OOM.

Python (pull weights from the Hub)

import numpy as np, soundfile as sf, torch, torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
import dac

REPO, SSL, FE_SR, SR_OUT = "Scicom-intl/sidon-callcentre", "facebook/w2v-bert-2.0", 16000, 48000
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ck = torch.load(hf_hub_download(REPO, "fe_callcentre/fe_adapter_full.pt"), map_location="cpu")
ad, scale = ck["adapter"], ck["lora_alpha"] / ck["r"]
fe = Wav2Vec2BertModel.from_pretrained(SSL, num_hidden_layers=ck.get("layers", 24), layerdrop=0.0)
sd = fe.state_dict()                                     # merge LoRA -> base (no peft needed)
for p in sorted({k[:-len(".lora_A.default.weight")] for k in ad if k.endswith(".lora_A.default.weight")}):
    sd[p+".weight"] = sd[p+".weight"].float() + scale * (ad[p+".lora_B.default.weight"].float() @ ad[p+".lora_A.default.weight"].float())
    if p+".base_layer.bias" in ad: sd[p+".bias"] = ad[p+".base_layer.bias"].to(sd[p+".bias"].dtype)
fe.load_state_dict(sd); fe.to(dev).eval()

dck = torch.load(hf_hub_download(REPO, "decoder_callcentre/decoder_only.pt"), map_location="cpu")
dec = dac.model.dac.Decoder(input_channel=1024, channels=dck.get("dec_channels", 3072), rates=[8,5,4,3,2])
dec.load_state_dict(dck["decoder"]); dec.to(dev).eval()
proc = AutoFeatureExtractor.from_pretrained(SSL)

@torch.no_grad()
def restore(path, out="restored48k.wav"):                # single straight pass
    x, sr = sf.read(path, dtype="float32"); x = x.mean(1) if x.ndim > 1 else x
    if sr != FE_SR: x = torchaudio.functional.resample(torch.from_numpy(x)[None], sr, FE_SR)[0].numpy()
    x = x / (np.abs(x).max() + 1e-9) * 0.95
    feats = {k: v.to(dev) for k, v in proc(x, sampling_rate=FE_SR, return_tensors="pt").items()}
    y = dec(fe(**feats).last_hidden_state.transpose(1, 2)).squeeze().float().cpu().numpy()
    sf.write(out, y / (np.abs(y).max() + 1e-9) * 0.97, SR_OUT); print("wrote", out)

restore("your_call.wav")   # <-- your own telephony/call-centre audio

Model details

FE: full 24-layer facebook/w2v-bert-2.0 + fresh LoRA (r=64, alpha=16, dropout=0.1, bias="lora_only", target_modules=["output_dense"]), trained by MSE distillation of a degraded signal's features toward a frozen teacher on the clean signal (~16M trainable).
Decoder: dac.model.dac.Decoder(input_channel=1024, channels=3072, rates=[8,5,4,3,2]) (188M, 50 fps x 960 = 48 kHz), trained with DAC multi-resolution mel + GAN (loss = 15*mel + 2*adv + 1*feat).
Degradation (train-time): telephone HP -> narrowband ceiling (8/11/12/16k) -> GSM / G.711-mu-law -> 16-40 kbps MP3 -> line noise + VoIP dropouts.
Teachers (clean 48 kHz): EARS + Expresso (studio) + DNSMOS-filtered multilingual HF datasets + DNSMOS-filtered Malaysian/Singaporean podcast & movie (Scicom-intl/sidon-callcentre-podcast).

License / intended use

cc-by-nc-4.0 — research / non-commercial. Built on facebook/w2v-bert-2.0 and Descript Audio Codec.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Scicom-intl/sidon-callcentre

Base model

facebook/w2v-bert-2.0

Adapter

(2)

this model