Sidon β€” Call-Centre / Telephony Speech Restoration

Restore narrowband, codec'd, noisy call-centre / telephony speech (e.g. 8 kHz G.711/GSM phone audio) to clean 48 kHz. Two stages, both trained for the telephony domain (Malaysian/Singaporean

  • multilingual clean teachers):
input (8-16 kHz telephony) --16k--> [FE: 24-layer w2v-BERT 2.0 + LoRA] --features[T,1024]-->
      [DAC decoder, 188M] --> 48 kHz clean waveform

The FE LoRA adapter is merged into the base weights at load time, so inference needs no peft β€” just transformers + descript-audio-codec.

Quick start β€” infer from the HF checkpoint

pip install torch torchaudio "transformers>=4.56" "descript-audio-codec>=1.0.0" soundfile "huggingface_hub[cli]"

# pull the CLI + the two slim checkpoints from the Hub
hf auth login   # private repo: log in first (or export HF_TOKEN=hf_...)
hf download Scicom-intl/sidon-callcentre \
    infer_callcentre.py fe_callcentre/fe_adapter_full.pt \
    decoder_callcentre/decoder_only.pt --local-dir sidon-callcentre

cd sidon-callcentre && python infer_callcentre.py \
    --input your_call.wav --out-dir out \
    --fe-adapter fe_callcentre/fe_adapter_full.pt \
    --decoder decoder_callcentre/decoder_only.pt \
    --chunk 0 --device cuda   # --chunk 0 = NO chunking (default single pass); --device cpu if no GPU
# -> out/your_call_restored48k.wav (clean 48 kHz)  +  out/your_call_orig48k.wav (A/B)

Prefer Python (load weights from the Hub with hf_hub_download)? See Python below.

Status: the decoder is still training (~step 30k of 100k) and these checkpoints are refreshed periodically β€” quality keeps improving. It already restores real 8 kHz call-centre audio well.

Files

Use the current-run checkpoints under fe_callcentre/ and decoder_callcentre/:

path role size
fe_callcentre/fe_adapter_full.pt FE adapter (inference) β€” 144 tensors: 96 LoRA + 48 trained output_dense biases ~63 MB
decoder_callcentre/decoder_only.pt decoder (inference) β€” 188M DAC decoder ~0.75 GB
fe_callcentre/last.pt, decoder_callcentre/last.pt raw checkpoints (resume training) ~2.5 / 2.8 GB
infer_callcentre.py inference CLI (below) β€”

For inference you only need the two slim files + infer_callcentre.py. (Root-level fe_adapter_full.pt / decoder_only.pt are from an earlier run and are superseded.)

End-to-end example (straight from HuggingFace)

pip install torch torchaudio "transformers>=4.56" "descript-audio-codec>=1.0.0" soundfile "huggingface_hub[cli]"

# pull the CLI + the two slim checkpoints, straight from this repo
hf auth login   # private repo: log in first (or export HF_TOKEN=hf_...)
hf download Scicom-intl/sidon-callcentre \
    infer_callcentre.py \
    fe_callcentre/fe_adapter_full.pt \
    decoder_callcentre/decoder_only.pt \
    --local-dir sidon-callcentre
cd sidon-callcentre

# restore your audio end-to-end
python infer_callcentre.py \
    --input your_call.wav \
    --out-dir out \
    --fe-adapter fe_callcentre/fe_adapter_full.pt \
    --decoder   decoder_callcentre/decoder_only.pt \
    --chunk 0 --device cuda   # --chunk 0 = NO chunking (single straight pass, default); --device cpu if no GPU

Outputs:

  • out/your_call_restored48k.wav β€” the restored clean 48 kHz speech.
  • out/your_call_orig48k.wav β€” the input, naively upsampled to 48 kHz (no model), for an A/B listen.

--input accepts a file or a directory (.wav/.flac/.mp3/.ogg/.opus/.m4a). Stereo (e.g. agent/customer on separate channels) is restored per channel and recombined.

Inference is a single straight pass (--chunk 0, the default): w2v-BERT 2.0 uses relative/rotary position embeddings and the DAC decoder is fully convolutional, so a full pass is length-invariant and cleanest. --chunk <seconds> enables crossfaded windowing purely as a memory fallback for very long audio (self-attention is O(T^2)); it is spectrally near-identical (log-mel corr β‰ˆ 0.98) but adds seams, so prefer the default single pass unless you hit OOM.

Python (pull weights from the Hub)

import numpy as np, soundfile as sf, torch, torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
import dac

REPO, SSL, FE_SR, SR_OUT = "Scicom-intl/sidon-callcentre", "facebook/w2v-bert-2.0", 16000, 48000
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ck = torch.load(hf_hub_download(REPO, "fe_callcentre/fe_adapter_full.pt"), map_location="cpu")
ad, scale = ck["adapter"], ck["lora_alpha"] / ck["r"]
fe = Wav2Vec2BertModel.from_pretrained(SSL, num_hidden_layers=ck.get("layers", 24), layerdrop=0.0)
sd = fe.state_dict()                                     # merge LoRA -> base (no peft needed)
for p in sorted({k[:-len(".lora_A.default.weight")] for k in ad if k.endswith(".lora_A.default.weight")}):
    sd[p+".weight"] = sd[p+".weight"].float() + scale * (ad[p+".lora_B.default.weight"].float() @ ad[p+".lora_A.default.weight"].float())
    if p+".base_layer.bias" in ad: sd[p+".bias"] = ad[p+".base_layer.bias"].to(sd[p+".bias"].dtype)
fe.load_state_dict(sd); fe.to(dev).eval()

dck = torch.load(hf_hub_download(REPO, "decoder_callcentre/decoder_only.pt"), map_location="cpu")
dec = dac.model.dac.Decoder(input_channel=1024, channels=dck.get("dec_channels", 3072), rates=[8,5,4,3,2])
dec.load_state_dict(dck["decoder"]); dec.to(dev).eval()
proc = AutoFeatureExtractor.from_pretrained(SSL)

@torch.no_grad()
def restore(path, out="restored48k.wav"):                # single straight pass
    x, sr = sf.read(path, dtype="float32"); x = x.mean(1) if x.ndim > 1 else x
    if sr != FE_SR: x = torchaudio.functional.resample(torch.from_numpy(x)[None], sr, FE_SR)[0].numpy()
    x = x / (np.abs(x).max() + 1e-9) * 0.95
    feats = {k: v.to(dev) for k, v in proc(x, sampling_rate=FE_SR, return_tensors="pt").items()}
    y = dec(fe(**feats).last_hidden_state.transpose(1, 2)).squeeze().float().cpu().numpy()
    sf.write(out, y / (np.abs(y).max() + 1e-9) * 0.97, SR_OUT); print("wrote", out)

restore("your_call.wav")   # <-- your own telephony/call-centre audio

Model details

  • FE: full 24-layer facebook/w2v-bert-2.0 + fresh LoRA (r=64, alpha=16, dropout=0.1, bias="lora_only", target_modules=["output_dense"]), trained by MSE distillation of a degraded signal's features toward a frozen teacher on the clean signal (~16M trainable).
  • Decoder: dac.model.dac.Decoder(input_channel=1024, channels=3072, rates=[8,5,4,3,2]) (188M, 50 fps x 960 = 48 kHz), trained with DAC multi-resolution mel + GAN (loss = 15*mel + 2*adv + 1*feat).
  • Degradation (train-time): telephone HP -> narrowband ceiling (8/11/12/16k) -> GSM / G.711-mu-law -> 16-40 kbps MP3 -> line noise + VoIP dropouts.
  • Teachers (clean 48 kHz): EARS + Expresso (studio) + DNSMOS-filtered multilingual HF datasets + DNSMOS-filtered Malaysian/Singaporean podcast & movie (Scicom-intl/sidon-callcentre-podcast).

License / intended use

cc-by-nc-4.0 β€” research / non-commercial. Built on facebook/w2v-bert-2.0 and Descript Audio Codec.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Scicom-intl/sidon-callcentre

Adapter
(2)
this model