MILFER

MILFER is a standalone PyTorch audio-to-audio model for speech-preserving audio restoration. It takes an input audio file, extracts SSL speech features, and reconstructs a 48 kHz waveform with the bundled neural decoder.

The bundled checkpoint is milfer_lora100h_step001000.

Highlights

  • Pure PyTorch inference, no TorchScript runtime required.
  • CUDA fp16 inference by default when a CUDA GPU is available.
  • Accepts common audio formats supported by torchaudio, including wav and mp3.
  • Emits a mono 48 kHz wav file.
  • Tuned to preserve more game/dialogue sound character than the base checkpoint.

Quick Start

python milfer.py input.wav output.wav

For CUDA fp16:

python milfer.py input.mp3 output.wav --device cuda --precision fp16

For repeated inference in the same Python process, compile the feature model and run one warm-up pass first:

python milfer.py input.mp3 output.wav --device cuda --precision fp16 --compile-feature

The helper script does the same thing with the local Python environment:

./run.sh input.wav output.wav --device cuda --precision fp16

Files

milfer.py
run.sh
weights/
  decoder_state_dict.pt
  feature_predictor_config.json
  feature_predictor_state_dict.pt
  milfer_config.json

Requirements

Tested with:

  • Python 3.10
  • PyTorch 2.6.0 + CUDA 12.4
  • torchaudio 2.6.0
  • transformers
  • soundfile
  • descript-audio-codec

Clean Input Check

The table below measures how much MILFER changes already-clean clips. It is a sanity check, not a denoising benchmark.

Evaluation set: prompts_5kh, 250 mono wav clips, 44.1 kHz, 19.0 minutes total. Higher is better for STOI, eSTOI, PESQ-WB, and MOS predictors. Lower is better for LSD and clipped samples.

Subset Files STOI eSTOI PESQ-WB LSD 16 kHz Clipped Samples
all clips 250 0.9241 0.8719 2.1653 11.825 dB 0.0006%
duration >= 1 s 232 0.9288 0.8767 2.1917 11.683 dB 0.0005%

No-reference MOS predictors on the original and processed outputs:

Subset Audio UTMOS DistillMOS NISQA-TTS
all clips original 2.9998 3.9392 3.6311
all clips MILFER 2.9741 3.8080 3.7021
all clips delta -0.0258 -0.1313 +0.0710
duration >= 1 s original 3.0120 3.9829 3.6603
duration >= 1 s MILFER 2.9977 3.8554 3.7483
duration >= 1 s delta -0.0143 -0.1275 +0.0880

Very short clips can make intelligibility metrics unstable, so the filtered row excludes clips shorter than one second.

Degraded-Input Evaluation

For a cleaner-style benchmark, the clean prompts were synthetically degraded and then processed with MILFER. Metrics compare either the degraded input or the MILFER output against the original clean prompt. The table uses the duration >= 1 s subset: 232 clips, 18.8 minutes total.

Degradation profiles:

  • noisy_room: additive noise, room response, light band-limiting.
  • radio_clip: band-pass channel, saturation, quantization, hiss.
  • mixed_hard: noise, reverb, band-limiting, downsampling, saturation.

Full-reference metrics:

Profile STOI eSTOI PESQ-WB LSD 16 kHz
noisy_room degraded 0.8830 0.7128 1.2104 18.121 dB
noisy_room MILFER 0.9020 0.8145 1.7757 13.174 dB
noisy_room delta +0.0190 +0.1017 +0.5653 -4.947 dB
mixed_hard degraded 0.8617 0.7079 1.1851 23.143 dB
mixed_hard MILFER 0.8948 0.8068 1.7321 13.904 dB
mixed_hard delta +0.0331 +0.0989 +0.5470 -9.239 dB
radio_clip degraded 0.9185 0.8528 1.8765 19.167 dB
radio_clip MILFER 0.9040 0.8397 1.9412 14.237 dB
radio_clip delta -0.0145 -0.0131 +0.0647 -4.930 dB

No-reference MOS predictors:

Profile UTMOS DistillMOS NISQA-TTS
noisy_room degraded 1.4220 2.7625 1.8844
noisy_room MILFER 3.0324 3.7896 3.7557
noisy_room delta +1.6104 +1.0270 +1.8713
mixed_hard degraded 1.3709 2.4796 2.1738
mixed_hard MILFER 2.9478 3.7044 3.7243
mixed_hard delta +1.5769 +1.2248 +1.5505
radio_clip degraded 1.4901 2.9555 2.5081
radio_clip MILFER 2.8414 3.7817 3.5840
radio_clip delta +1.3513 +0.8262 +1.0759

Notes

  • Input audio is mixed to mono and resampled to 16 kHz for feature extraction.
  • Output is written as mono 48 kHz PCM wav.
  • Very long files can be processed, but peak memory depends on input duration.
  • This is an experimental checkpoint. It can still change ambience, effects, music, and non-speech sounds.

License

License is not specified in this package. Set the final license field before publishing if you need redistributable model weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support