Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation Paper • 2211.06687 • Published Nov 12, 2022 • 3
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning Paper • 2401.17690 • Published Jan 31 • 5
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit Paper • 2312.09911 • Published Dec 15, 2023 • 53
Audiobox: Unified Audio Generation with Natural Language Prompts Paper • 2312.15821 • Published Dec 25, 2023 • 13
Masked Audio Text Encoders are Effective Multi-Modal Rescorers Paper • 2305.07677 • Published May 11, 2023 • 2
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations Paper • 2401.01885 • Published Jan 3 • 27
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts Paper • 2105.03036 • Published May 7, 2021 • 2
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Paper • 2310.12404 • Published Oct 19, 2023 • 15
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Paper • 2310.11954 • Published Oct 18, 2023 • 25
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models Paper • 2311.07919 • Published Nov 14, 2023 • 9
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models Paper • 2306.07691 • Published Jun 13, 2023 • 4
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data Paper • 2402.08093 • Published Feb 12 • 57
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models Paper • 2403.14438 • Published Mar 21 • 2
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations Paper • 2308.11466 • Published Aug 22, 2023 • 1
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training Paper • 2108.06209 • Published Aug 7, 2021 • 1