MILFER

MILFER is a standalone PyTorch audio-to-audio model for speech-preserving audio restoration. It takes an input audio file, extracts SSL speech features, and reconstructs a 48 kHz waveform with the bundled neural decoder.

The bundled checkpoint is milfer_lora100h_step001000.

Highlights

Pure PyTorch inference, no TorchScript runtime required.
CUDA fp16 inference by default when a CUDA GPU is available.
Accepts common audio formats supported by torchaudio, including wav and mp3.
Emits a mono 48 kHz wav file.
Tuned to preserve more game/dialogue sound character than the base checkpoint.

Quick Start

python milfer.py input.wav output.wav

For CUDA fp16:

python milfer.py input.mp3 output.wav --device cuda --precision fp16

For repeated inference in the same Python process, compile the feature model and run one warm-up pass first:

python milfer.py input.mp3 output.wav --device cuda --precision fp16 --compile-feature

The helper script does the same thing with the local Python environment:

./run.sh input.wav output.wav --device cuda --precision fp16

Files

milfer.py
run.sh
weights/
  decoder_state_dict.pt
  feature_predictor_config.json
  feature_predictor_state_dict.pt
  milfer_config.json

Requirements

Tested with:

Python 3.10
PyTorch 2.6.0 + CUDA 12.4
torchaudio 2.6.0
transformers
soundfile
descript-audio-codec

Clean Input Check

The table below measures how much MILFER changes already-clean clips. It is a sanity check, not a denoising benchmark.

Evaluation set: prompts_5kh, 250 mono wav clips, 44.1 kHz, 19.0 minutes total. Higher is better for STOI, eSTOI, PESQ-WB, and MOS predictors. Lower is better for LSD and clipped samples.

Subset	Files	STOI	eSTOI	PESQ-WB	LSD 16 kHz	Clipped Samples
all clips	250	0.9241	0.8719	2.1653	11.825 dB	0.0006%
duration >= 1 s	232	0.9288	0.8767	2.1917	11.683 dB	0.0005%

No-reference MOS predictors on the original and processed outputs:

Subset	Audio	UTMOS	DistillMOS	NISQA-TTS
all clips	original	2.9998	3.9392	3.6311
all clips	MILFER	2.9741	3.8080	3.7021
all clips	delta	-0.0258	-0.1313	+0.0710
duration >= 1 s	original	3.0120	3.9829	3.6603
duration >= 1 s	MILFER	2.9977	3.8554	3.7483
duration >= 1 s	delta	-0.0143	-0.1275	+0.0880

Very short clips can make intelligibility metrics unstable, so the filtered row excludes clips shorter than one second.

Degraded-Input Evaluation

For a cleaner-style benchmark, the clean prompts were synthetically degraded and then processed with MILFER. Metrics compare either the degraded input or the MILFER output against the original clean prompt. The table uses the duration >= 1 s subset: 232 clips, 18.8 minutes total.

Degradation profiles:

noisy_room: additive noise, room response, light band-limiting.
radio_clip: band-pass channel, saturation, quantization, hiss.
mixed_hard: noise, reverb, band-limiting, downsampling, saturation.

Full-reference metrics:

Profile	STOI	eSTOI	PESQ-WB	LSD 16 kHz
noisy_room degraded	0.8830	0.7128	1.2104	18.121 dB
noisy_room MILFER	0.9020	0.8145	1.7757	13.174 dB
noisy_room delta	+0.0190	+0.1017	+0.5653	-4.947 dB
mixed_hard degraded	0.8617	0.7079	1.1851	23.143 dB
mixed_hard MILFER	0.8948	0.8068	1.7321	13.904 dB
mixed_hard delta	+0.0331	+0.0989	+0.5470	-9.239 dB
radio_clip degraded	0.9185	0.8528	1.8765	19.167 dB
radio_clip MILFER	0.9040	0.8397	1.9412	14.237 dB
radio_clip delta	-0.0145	-0.0131	+0.0647	-4.930 dB

No-reference MOS predictors:

Profile	UTMOS	DistillMOS	NISQA-TTS
noisy_room degraded	1.4220	2.7625	1.8844
noisy_room MILFER	3.0324	3.7896	3.7557
noisy_room delta	+1.6104	+1.0270	+1.8713
mixed_hard degraded	1.3709	2.4796	2.1738
mixed_hard MILFER	2.9478	3.7044	3.7243
mixed_hard delta	+1.5769	+1.2248	+1.5505
radio_clip degraded	1.4901	2.9555	2.5081
radio_clip MILFER	2.8414	3.7817	3.5840
radio_clip delta	+1.3513	+0.8262	+1.0759

Notes

Input audio is mixed to mono and resampled to 16 kHz for feature extraction.
Output is written as mono 48 kHz PCM wav.
Very long files can be processed, but peak memory depends on input duration.
This is an experimental checkpoint. It can still change ambience, effects, music, and non-speech sounds.

License

License is not specified in this package. Set the final license field before publishing if you need redistributable model weights.

Downloads last month: -; Downloads are not tracked for this model. How to track