English
French
Edit model card

NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model

Features

  • 🤖 An end-to-end model without intermediate text decoding
  • 💪 Supports offline and streaming decoding of all modalities
  • ⚡️ 28× faster inference compared to autoregressive models

Examples

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

  • Generation with chunk sizes of 320 ms and 2560 ms starts generating English translation before the source speech is complete.
  • In the examples of simultaneous interpretation, the left audio channel is the input streaming speech, and the right audio channel is the simultaneous translation.

    For a better experience, please wear headphones.

Chunk Size 320ms Chunk Size 2560ms Offline
Source Speech Transcript Reference Text Translation
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné. before the fusion of the towns rouge thier was a part of the town of louveigne

For more examples, please check https://nast-s2x.github.io/.

Performance

  • ⚡️ Lightning Fast: 28× faster inference and competitive quality in offline speech-to-speech translation
  • 👩‍💼 Simultaneous: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
  • 🤖 Unified Framework: Support end-to-end text & speech generation in one model

Check Details 👇

Offline-S2S
image
Simul-S2S Simul-S2T
image image

Architecture

  • Fully Non-autoregressive: Trained with CTC-based non-monotonic latent alignment loss (Shao and Feng, 2022) and glancing mechanism (Qian et al., 2021).
  • Minimum Human Design: Seamlessly switch between offline translation and simultaneous interpretation by adjusting the chunk size.
  • End-to-End: Generate target speech without target text decoding.

Sources and Usage

Model

We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.

🤗 Model card

Chunk Size checkpoint ASR-BLEU ASR-BLEU (Silence Removed) Average Lagging
320ms checkpoint 19.67 24.90 -393ms
1280ms checkpoint 20.20 25.71 3330ms
2560ms checkpoint 24.88 26.14 4976ms
Offline checkpoint 25.82 - -
Vocoder

checkpoint

Inference

Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.

Offline Inference

Simultaneous Inference

  • We use our customized fork of SimulEval: b43a7c to evaluate the model in simultaneous inference. This repository is built upon the official SimulEval: a1435b and includes additional latency scorers.
  • Data preprocessing: Follow the instructions in the document.
  • Streaming Generation and Evaluation: Excute streaming_infer.sh

Train your own NAST-S2X

Citing

Please kindly cite us if you find our papers or codes useful.

@inproceedings{
ma2024nonautoregressive,
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
},
booktitle={Proceedings of ACL 2024},
year={2024},
}

@inproceedings{
fang2024ctcs2ut,
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
},
booktitle={Findings of ACL 2024},
year={2024},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Unable to determine this model's library. Check the docs .

Dataset used to train ICTNLP/NAST-S2X