MILFER
MILFER is a standalone PyTorch audio-to-audio model for speech-preserving audio restoration. It takes an input audio file, extracts SSL speech features, and reconstructs a 48 kHz waveform with the bundled neural decoder.
The bundled checkpoint is milfer_lora100h_step001000.
Highlights
- Pure PyTorch inference, no TorchScript runtime required.
- CUDA fp16 inference by default when a CUDA GPU is available.
- Accepts common audio formats supported by
torchaudio, including wav and mp3. - Emits a mono 48 kHz wav file.
- Tuned to preserve more game/dialogue sound character than the base checkpoint.
Quick Start
python milfer.py input.wav output.wav
For CUDA fp16:
python milfer.py input.mp3 output.wav --device cuda --precision fp16
For repeated inference in the same Python process, compile the feature model and run one warm-up pass first:
python milfer.py input.mp3 output.wav --device cuda --precision fp16 --compile-feature
The helper script does the same thing with the local Python environment:
./run.sh input.wav output.wav --device cuda --precision fp16
Files
milfer.py
run.sh
weights/
decoder_state_dict.pt
feature_predictor_config.json
feature_predictor_state_dict.pt
milfer_config.json
Requirements
Tested with:
- Python 3.10
- PyTorch 2.6.0 + CUDA 12.4
- torchaudio 2.6.0
- transformers
- soundfile
- descript-audio-codec
Clean Input Check
The table below measures how much MILFER changes already-clean clips. It is a sanity check, not a denoising benchmark.
Evaluation set: prompts_5kh, 250 mono wav clips, 44.1 kHz, 19.0 minutes total.
Higher is better for STOI, eSTOI, PESQ-WB, and MOS predictors. Lower is better
for LSD and clipped samples.
| Subset | Files | STOI | eSTOI | PESQ-WB | LSD 16 kHz | Clipped Samples |
|---|---|---|---|---|---|---|
| all clips | 250 | 0.9241 | 0.8719 | 2.1653 | 11.825 dB | 0.0006% |
| duration >= 1 s | 232 | 0.9288 | 0.8767 | 2.1917 | 11.683 dB | 0.0005% |
No-reference MOS predictors on the original and processed outputs:
| Subset | Audio | UTMOS | DistillMOS | NISQA-TTS |
|---|---|---|---|---|
| all clips | original | 2.9998 | 3.9392 | 3.6311 |
| all clips | MILFER | 2.9741 | 3.8080 | 3.7021 |
| all clips | delta | -0.0258 | -0.1313 | +0.0710 |
| duration >= 1 s | original | 3.0120 | 3.9829 | 3.6603 |
| duration >= 1 s | MILFER | 2.9977 | 3.8554 | 3.7483 |
| duration >= 1 s | delta | -0.0143 | -0.1275 | +0.0880 |
Very short clips can make intelligibility metrics unstable, so the filtered row excludes clips shorter than one second.
Degraded-Input Evaluation
For a cleaner-style benchmark, the clean prompts were synthetically degraded and
then processed with MILFER. Metrics compare either the degraded input or the
MILFER output against the original clean prompt. The table uses the
duration >= 1 s subset: 232 clips, 18.8 minutes total.
Degradation profiles:
noisy_room: additive noise, room response, light band-limiting.radio_clip: band-pass channel, saturation, quantization, hiss.mixed_hard: noise, reverb, band-limiting, downsampling, saturation.
Full-reference metrics:
| Profile | STOI | eSTOI | PESQ-WB | LSD 16 kHz |
|---|---|---|---|---|
| noisy_room degraded | 0.8830 | 0.7128 | 1.2104 | 18.121 dB |
| noisy_room MILFER | 0.9020 | 0.8145 | 1.7757 | 13.174 dB |
| noisy_room delta | +0.0190 | +0.1017 | +0.5653 | -4.947 dB |
| mixed_hard degraded | 0.8617 | 0.7079 | 1.1851 | 23.143 dB |
| mixed_hard MILFER | 0.8948 | 0.8068 | 1.7321 | 13.904 dB |
| mixed_hard delta | +0.0331 | +0.0989 | +0.5470 | -9.239 dB |
| radio_clip degraded | 0.9185 | 0.8528 | 1.8765 | 19.167 dB |
| radio_clip MILFER | 0.9040 | 0.8397 | 1.9412 | 14.237 dB |
| radio_clip delta | -0.0145 | -0.0131 | +0.0647 | -4.930 dB |
No-reference MOS predictors:
| Profile | UTMOS | DistillMOS | NISQA-TTS |
|---|---|---|---|
| noisy_room degraded | 1.4220 | 2.7625 | 1.8844 |
| noisy_room MILFER | 3.0324 | 3.7896 | 3.7557 |
| noisy_room delta | +1.6104 | +1.0270 | +1.8713 |
| mixed_hard degraded | 1.3709 | 2.4796 | 2.1738 |
| mixed_hard MILFER | 2.9478 | 3.7044 | 3.7243 |
| mixed_hard delta | +1.5769 | +1.2248 | +1.5505 |
| radio_clip degraded | 1.4901 | 2.9555 | 2.5081 |
| radio_clip MILFER | 2.8414 | 3.7817 | 3.5840 |
| radio_clip delta | +1.3513 | +0.8262 | +1.0759 |
Notes
- Input audio is mixed to mono and resampled to 16 kHz for feature extraction.
- Output is written as mono 48 kHz PCM wav.
- Very long files can be processed, but peak memory depends on input duration.
- This is an experimental checkpoint. It can still change ambience, effects, music, and non-speech sounds.
License
License is not specified in this package. Set the final license field before publishing if you need redistributable model weights.