OmniAudio-2.6B / README.md
alanzhuly's picture
Create README.md
598a0e1 verified
|
raw
history blame
4.06 kB
metadata
license: apache-2.0
language:
  - en
tags:
  - audio-text-to-text
  - chat
  - audio
  - GGUF

OmniAudio-2.6B

OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment - a 2.6B-parameter multimodal model that processes both text and audio inputs. It integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module, enabling secure, responsive audio-text processing directly on edge devices. Unlike traditional approaches that chain ASR and LLM models together, OmniAudio-2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead.

On a 2024 Mac Mini M4 Pro using Q4_K_M quantized GGUF model, Qwen2-Audio-7B processes 1.69 tokens/second while OmniAudio-2.6B achieves 4.97 tokens/second, demonstrating nearly 3x faster performance on consumer hardware.

Quick Links

  1. Interactive Demo in our HuggingFace Space.
  2. Quickstart for local setup
  3. Learn more in our Blogs

Use Cases

  • Voice QA without Internet: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
  • Voice-in Conversation: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening.
  • Creative Content Generation: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input.
  • Recording Summary: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries.
  • Voice Tone Modification: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message.

Run OmniAudio-2.6B on Your Device

Step 1: Install Nexa-SDK (local on-device inference framework) Install Nexa-SDK

Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer. Step 2: Then run the following code in your terminal

nexa run omniaudio -st

💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.

Training

We developed OmniAudio through a three-stage training pipeline: Pretraining: The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases. Supervised Fine-tuning (SFT): We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding. Direct Preference Optimization (DPO): The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.

What's Next for OmniAudio?

OmniAudio is in active development and we are working to advance its capabilities:

  • Building direct audio generation for two-way voice communication
  • Implementing function calling support via Octopus_v2 integration In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.