Audio-to-Audio is a family of tasks in which the input is an audio and the output is one or multiple generated audios. Some example tasks are speech enhancement and source separation.

Audio-to-Audio Model

About Audio-to-Audio

Use Cases

Speech Enhancement (Noise removal)

Speech Enhancement is a bit self explanatory. It improves (or enhances) the quality of an audio by removing noise. There are multiple libraries to solve this task, such as Speechbrain, Asteroid and ESPNet. Here is a simple example using Speechbrain

from speechbrain.pretrained import SpectralMaskEnhancement
model = SpectralMaskEnhancement.from_hparams(

Alternatively, you can use Inference Endpoints to solve this task

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/speechbrain/mtl-mimic-voicebank"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")

You can use huggingface.js to infer with audio-to-audio models on Hugging Face Hub.

import { HfInference } from "@huggingface/inference";

const inference = new HfInference(HF_TOKEN);
await inference.audioToAudio({
    data: await (await fetch("sample.flac")).blob(),
    model: "speechbrain/sepformer-wham",

Audio Source Separation

Audio Source Separation allows you to isolate different sounds from individual sources. For example, if you have an audio file with multiple people speaking, you can get an audio file for each of them. You can then use an Automatic Speech Recognition system to extract the text from each of these sources as an initial step for your system!

Audio-to-Audio can also be used to remove noise from audio files: you get one audio for the person speaking and another audio for the noise. This can also be useful when you have multi-person audio with some noise: yyou can get one audio for each person and then one audio for the noise.

Training a model for your own data

If you want to learn how to train models for the Audio-to-Audio task, we recommend the following tutorials:

Compatible libraries

Audio-to-Audio demo
Models for Audio-to-Audio
Browse Models (3,693)
Datasets for Audio-to-Audio
Browse Datasets (30)

Note 512-element X-vector embeddings of speakers from CMU ARCTIC dataset.

Spaces using Audio-to-Audio

Note An application for speech separation.

Note An application for audio style transfer.

Metrics for Audio-to-Audio
The Signal-to-Noise ratio is the relationship between the target signal level and the background noise level. It is calculated as the logarithm of the target signal divided by the background noise, in decibels.
The Signal-to-Distortion ratio is the relationship between the target signal and the sum of noise, interference, and artifact errors