File size: 3,285 Bytes
b695973 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
# Qwen2-Audio
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/ThcKJj7LcWCZPwN1So05f.png" alt="Example" style="width:700px;"/>
Qwen2-Audio is a SOTA small-scale multimodal model that handles audio and text inputs, allowing you to have voice interactions without ASR modules. Qwen2-Audio supports English, Chinese, and major European languages,and also provides robust audio analysis for local use cases like:
- Speaker identification and response
- Speech translation and transcription
- Mixed audio and noise detection
- Music and sound analysis
## We're bringing Qwen2-Audio to edge devices with Nexa SDK, offering various quantization options.
- Voice Chat: Users can freely engage in voice interactions with Qwen2-Audio without text input.
- Audio Analysis: Users can provide both audio and text instructions for analysis during the interaction.
### Demo
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/02XDwJe3bhZHYptor-b2_.mp4"></video>
## How to Run Locally On-Device
In the following, we demonstrate how to run Qwen2-Audio locally on your device.
**Step 1: Install Nexa-SDK (local on-device inference framework)**
[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.
**Step 2: Then run the following code in your terminal to run with local streamlit UI**
```bash
nexa run qwen2audio -st
```
**or to use in terminal**:
```bash
nexa run qwen2audio
```
### Usage Instructions
For terminal:
1. Drag and drop your audio file into the terminal (or enter file path on Linux)
2. Add text prompt to guide analysis or leave empty for direct voice input
### System Requirements
💻 **RAM Requirements**:
- Default q4_K_M version requires 4.2GB of RAM
- Check the RAM requirements table for different quantization versions
🎵 **Audio Format**:
- Optimal: 16kHz `.wav` format
- Other formats and sample rates are supported with automatic conversion
## Use Cases
### Voice Chat
- Answer daily questions
- Offer suggestions
- Speaker identification and response
- Speech translation
- Detecting background noise and responding accordingly
### Audio Analysis
- Information Extraction
- Audio summary
- Speech Transcription and Expansion
- Mixed audio and noise detection
- Music and sound analysis
## Performance Benchmark
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/lax8bLpR5uK2_Za0G6G3j.png" alt="Example" style="width:700px;"/>
Results demonstrate that Qwen2-Audio significantly outperforms either previous SOTAs or Qwen-Audio across all tasks.
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/2vACK_gD_MAuZ7Hn4Yfiv.png" alt="Example" style="width:700px;"/>
To learn more about Qwen2-Audio's capability, please refer to their [Blog], [GitHub], and [Report].
## Follow Nexa AI to run more models on-device
[Website](https://nexa.ai/) |