MiloMusic_YuEGP / README.md
futurespyhi
Complete MiloMusic implementation with voice-to-song generation
658e790

A newer version of the Gradio SDK is available: 5.49.1

Upgrade
metadata
title: MiloMusic - AI Music Generation
emoji: 🎡
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.25.0
app_file: app.py
pinned: false
python_version: '3.10'
license: mit
short_description: AI-powered voice-to-song generation using YuE model

MiloMusic 🎡 - Hugging Face Spaces

License: BSD

πŸ¦™ AI-Powered Music Creation for Everyone

MiloMusic is an innovative platform that leverages multiple AI models to democratize music creation. Whether you're a seasoned musician or have zero musical training, MiloMusic enables you to create high-quality, lyrics-focused music through natural language conversation.

A platform for everyone - regardless of musical training at the intersection of AI and creative expression.

πŸš€ Features

  • Natural Language Interface - Just start talking to generate song lyrics
  • Genre & Mood Selection - Customize your music with different genres and moods
  • Iterative Creation Process - Refine your lyrics through conversation
  • High-Quality Music Generation - Transform lyrics into professional-sounding music
  • User-Friendly Interface - Intuitive UI built with Gradio

πŸ”§ Architecture

MiloMusic employs a sophisticated multi-model pipeline to deliver a seamless music creation experience:

Phase 1: Lyrics Generation

  1. Speech-to-Text - User voice input is transcribed using whisper-large-v3-turbo (via Groq API)
  2. Conversation & Refinement - llama-4-scout-17b-16e-instruct handles the creative conversation, generates lyrics based on user requests, and allows for iterative refinement

Phase 2: Music Generation

  1. Lyrics Structuring - Gemini flash 2.0 processes the conversation history and structures the final lyrics for music generation
  2. Music Synthesis - YuE (乐) transforms the structured lyrics into complete songs with vocals and instrumentation

πŸ’» Technical Stack

  • LLM Models:
    • whisper-large-v3-turbo (via Groq) - For speech-to-text conversion
    • llama-4-scout-17b-16e-instruct - For creative conversation and lyrics generation
    • Gemini flash 2.0 - For lyrics structuring
    • YuE - For music generation
  • UI: Gradio 5.25.0
  • Backend: Python 3.10
  • Deployment: Hugging Face Spaces with GPU support

πŸ“‹ System Requirements

  • Python: 3.10 (strict requirement for YuE model compatibility)
  • CUDA: 12.4+ for GPU acceleration
  • Memory: 32GB+ RAM for model operations
  • GPU: A10G/T4 or better with 24GB+ VRAM

πŸ” Usage

Using the Interface:

  1. Select your genre, mood, and theme preferences
  2. Start talking about your song ideas
  3. The assistant will create lyrics based on your selections
  4. Give feedback to refine the lyrics
  5. When you're happy with the lyrics, click "Generate Music from Lyrics"
  6. Listen to your generated song!

πŸ”¬ Performance

Music generation typically takes:

  • GPU-accelerated: ~5-10 minutes per song
  • Quality: Professional-grade vocals and instrumentation
  • Format: High-quality audio output

πŸ› οΈ Development Notes

Spaces-Specific Configuration:

  • Custom PyTorch build with CUDA 12.4 support
  • Flash Attention compiled from source for optimal performance
  • Specialized audio processing pipeline for cloud deployment

Key Components:

  • requirements_space.txt - Dependencies with CUDA-specific PyTorch
  • packages.txt - System packages for audio and compilation
  • Pre-build flash-attn installation for compatibility

🚨 Important Notes

  • First run may take longer as models are downloaded and cached
  • Flash Attention compilation happens during startup (may take 10-15 minutes on first build)
  • Memory usage is high during music generation - please be patient

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request to the main repository.

πŸ‘₯ Team

  • Norton Gu
  • Anakin Huang
  • Erik Wasmosy

πŸ“ License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.


Made with ❀️ and πŸ¦™ (LLaMA) | Deployed on πŸ€— Hugging Face Spaces