RAVE — AEmotionStudio mirror
Curated mirror of public RAVE (Realtime Audio Variational autoEncoder) checkpoints, used by MAESTRO's RAVE Timbre Transfer panel (opt-in starter pack). Sources:
- The Intelligent-Instruments-Lab/rave-models curated set (birds, voices, organs, water, etc.).
- The official ACIDS-IRCAM public catalog, pulled from the canonical anonymous API at
https://play.forum.ircam.fr/rave-vst-api/get_available_models.
RAVE was developed by Antoine Caillon and the ACIDS team at IRCAM. Paper: arXiv:2111.05011. Upstream code: acids-ircam/RAVE.
License
CC-BY-NC-4.0 — non-commercial use only, inherited from the upstream distributions. Generated audio is fine for non-commercial use. Commercial use of the models themselves (e.g. shipping them inside a paid product) requires permission from the original authors / IRCAM.
Per MAESTRO's stance (see LICENSE_AUDIT.md and the feedback_download_on_demand_licensing
memory), these weights are fetched on demand by the end user — the user (not MAESTRO the
binary) is the licensee.
Models — IIL-curated set (b2048 streaming exports, 18 models)
Each .ts checkpoint has a <stem>.json sidecar with name, license, sample-rate, latent-dim,
source URL, and a one-line description.
Voice / speech
voice_vocalset_b2048_r48000_z16.ts— Voice (VocalSet). Voice timbre trained on the VocalSet corpus — covers vocal techniques across multiple singers. Use for the canonical 'make this sound like a voice' transfer.voice-multi-b2048-r48000-z11.ts— Voice (Multi-speaker). Aggregated multi-speaker voice corpus. Wider speaker diversity than VocalSet — produces more 'average human' renders.voice_hifitts_b2048_r48000_z16.ts— Voice (HiFi-TTS). High-fidelity expressive English speech corpus. Cleaner, more articulate than the multi-speaker model.voice_jvs_b2048_r44100_z16.ts— Voice (JVS, Japanese). JVS Japanese multi-speaker corpus at 44.1 kHz. Use for Japanese-language sources or non-Latin phoneme structure.voice_vctk_b2048_r44100_z22.ts— Voice (VCTK, English). VCTK English multi-speaker corpus from CSTR Edinburgh, 44.1 kHz. High 22-dim latent — captures more speaker idiosyncrasies.
Bird / wildlife
birds_motherbird_b2048_r48000_z16.ts— Birds (Motherbird). Bird-vocalization corpus — chirps + textural transients. The canonical 'weird' pick: produces wildly warped output for any arbitrary input.birds_dawnchorus_b2048_r48000_z8.ts— Birds (Dawn Chorus). Dense overlapping bird vocalizations recorded at dawn. Smaller 8-dim latent — outputs lean ensemble-textural over individual calls.birds_pluma_b2048_r48000_z12.ts— Birds (Pluma). Lighter, individual bird-call timbres. Mid-size 12-dim latent balances character + clarity.humpbacks_pondbrain_b2048_r48000_z20.ts— Humpback Whales. Humpback-whale song. Long, slow, hauntingly-deep vocal contours — pairs well with sustained input.marinemammals_pondbrain_b2048_r48000_z20.ts— Marine Mammals. Mixed marine-mammal vocalizations — dolphins, orcas, sea-life clicks and cries.
Instruments
guitar_iil_b2048_r48000_z16.ts— Guitar (IIL). Acoustic / electric guitar timbre. Good demo for transferring voice or synth input into a plucked-string voice.organ_bach_b2048_r48000_z16.ts— Organ (Bach). Pipe-organ timbre trained on Bach repertoire. Sustained harmonic textures — pairs well with melodic input.organ_archive_b2048_r48000_z16.ts— Organ (Archive). Historical pipe-organ recordings — broader, dustier textures than the Bach model. Good for film-score atmospheres.sax_soprano_franziskaschroeder_b2048_r48000_z20.ts— Soprano Sax (Schroeder). Soprano-saxophone extended techniques by Franziska Schroeder. Multiphonics, growls, key clicks. 20-dim latent — captures fine-grained articulation.mrp_strengjavera_b2048_r44100_z16.ts— Magnetic Resonator Piano (Strengjavera). Sustained metallic-string overtones produced by electromagnetically driving piano strings — 44.1 kHz.crozzoli_bigensemblesmusic_18d.ts— Big Ensemble Music (Crozzoli). Big-ensemble orchestral music (M. Crozzoli). Broad 18-dim latent for hugely-textured renders. Sample rate not embedded in filename — defaults to 48 kHz.
Textures / environment
water_pondbrain_b2048_r48000_z16.ts— Water (PondBrain). Water / aquatic textures. Treats any input as if it were running through liquid — bubbles, ripples, splashes.magnets_b2048_r48000_z8.ts— Magnets. Ferromagnetic / electromagnetic resonance textures — metallic hums, distant industrial buzz, magnetized-string ringing.
Models — ACIDS public catalog (10 models, mirrored 2026-05-18)
Pulled from the canonical anonymous-download endpoint https://play.forum.ircam.fr/rave-vst-api/get_model/<slug>.
Each .ts has a matching <slug>.json sidecar in the same schema as the IIL set.
| Slug | Display name | Type | Author | Year | Size | Prior |
|---|---|---|---|---|---|---|
VCTK |
VCTK (English Speech) | RAVE v1 (default) | Jb Dupuy | 2022 | 177 MB | ✓ |
darbouka_onnx |
Darbouka (Percussion) | RAVE v2 (ONNX) | Antoine Caillon | 2022 | 26 MB | – |
nasa |
NASA Apollo 11 | RAVE v1 (default) | Antoine Caillon | 2022 | 159 MB | ✓ |
percussion |
Percussion (Mixed) | RAVE v1 (default) | Antoine Caillon | 2022 | 71 MB | ✓ |
vintage |
Vintage Music | RAVE v1 (large) | Antoine Caillon | 2022 | 482 MB | ✓ |
isis |
ISiS (IRCAM Vocal DB) | RAVE v2 | A. Chemla–Romeu-Santos | 2023 | 149 MB | – |
musicnet |
MusicNet (Classical) | RAVE v2 | A. Chemla–Romeu-Santos | 2023 | 237 MB | ✓ |
sol_ordinario |
Studio OnLine (Ordinario) | RAVE v2 | A. Chemla–Romeu-Santos | 2023 | 149 MB | – |
sol_full |
Studio OnLine (Full) | RAVE v2 | A. Chemla–Romeu-Santos | 2023 | 149 MB | – |
sol_ordinario_fast |
Studio OnLine (Ordinario, fast) | RAVE v2 (small) | A. Chemla–Romeu-Santos | 2023 | 43 MB | – |
ACIDS set total: ~1.6 GB across 10 models.
Note:
VCTK.ts(ACIDS v1, 48 kHz, original 2022 release) andvoice_vctk_b2048_r44100_z22.ts(IIL v2 retrain, 44.1 kHz) are different models trained on the same source corpus — keep both for comparison.
File format
Each *.ts is a TorchScript export of the RAVE model,
streaming-mode (causal convolutions, cached state) — ready for realtime or offline inference.
import torch
model = torch.jit.load("vintage.ts")
# Encode (B, 1, T) → latents
z = model.encode(audio)
# Decode latents → audio
y = model.decode(z)
Models with "Prior available" additionally ship a learned prior that can generate latents autoregressively (see the RAVE repo for usage).
Where to find more RAVE models
- Neutone FX models — community + curated
.nmfiles (the Neutone wrapper format). - IRCAM Forum projects — individual user-submitted models; many require Forum account.
- acids-ircam GitHub releases — reference checkpoints from the maintainers.
- IRCAM RAVE Model Challenge 2025 — 11 prize-winner / submission models gated behind a Forum account.
Citation
@inproceedings{caillon2021rave,
title={RAVE: A variational autoencoder for fast and high-quality neural audio synthesis},
author={Caillon, Antoine and Esling, Philippe},
booktitle={arXiv preprint arXiv:2111.05011},
year={2021}
}