EchoX-3B / README.md
KurtDu's picture
Update README.md
501d5b7 verified
metadata
language:
  - en
tags:
  - audio-text-to-audio-text
  - speech-understanding
  - audio
  - chat
license: apache-2.0
datasets:
  - custom
metrics:
  - wer
  - bleu
  - AIR-Bench

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

πŸˆβ€β¬› Github ο½œ  πŸ“ƒ Paper ο½œ  πŸš€ Space (8B) ο½œ  πŸ“Š EchoX-Dialougues ο½œ  πŸ“Š EchoX-Dialogues-Plus

Model Description

EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.

Key Features

  • Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs
  • Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)
  • Trained on Only 10k Hours of Curated Data, Ensuring Efficiency
  • Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks
  • Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks

Usage

Load the EchoX model and run inference with your audio files as shown in the GitHub repository.

πŸ“– Citation

@misc{zhang2025echoxmitigatingacousticsemanticgap,
      title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, 
      author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
      year={2025},
      eprint={2509.09174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.09174}, 
}