|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
A [SoundStream](https://arxiv.org/abs/2107.03312) decoder to reconstruct audio from a mel-spectrogram. |
|
|
|
## Overview |
|
|
|
This model is a SoundStream decoder which inverts mel-spectrograms computed with the specific hyperparameters defined in the example below. This model was trained on music data and used in [Multi-instrument Music Synthesis with Spectrogram Diffusion](https://arxiv.org/abs/2206.05408) (ISMIR 2022). |
|
|
|
A typical use-case is to simplify music generation by predicting mel-spectrograms (instead of a raw waveform), and then use this model to reconstruct audio. |
|
|
|
If you use it, please consider citing: |
|
|
|
```bibtex |
|
@article{zeghidour2021soundstream, |
|
title={Soundstream: An end-to-end neural audio codec}, |
|
author={Zeghidour, Neil and Luebs, Alejandro and Omran, Ahmed and Skoglund, Jan and Tagliasacchi, Marco}, |
|
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, |
|
volume={30}, |
|
pages={495--507}, |
|
year={2021}, |
|
publisher={IEEE} |
|
} |
|
``` |
|
|
|
## Example Use |
|
|
|
```python |
|
from diffusers import OnnxRuntimeModel |
|
|
|
|
|
SAMPLE_RATE = 16000 |
|
N_FFT = 1024 |
|
HOP_LENGTH = 320 |
|
WIN_LENGTH = 640 |
|
N_MEL_CHANNELS = 128 |
|
MEL_FMIN = 0.0 |
|
MEL_FMAX = int(SAMPLE_RATE // 2) |
|
CLIP_VALUE_MIN = 1e-5 |
|
CLIP_VALUE_MAX = 1e8 |
|
|
|
mel = ... |
|
|
|
melgan = OnnxRuntimeModel.from_pretrained("kashif/soundstream_mel_decoder") |
|
|
|
audio = melgan(input_features=mel.astype(np.float32)) |
|
``` |