arxiv:2407.03169

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Published on Jul 3

· Submitted by

chaoweihuang on Jul 4

Upvote

Authors:

Chao-Wei Huang ,

Hui Lu ,

Hongyu Gong ,

Ilia Kulikov ,

Ruslan Mavlyutov ,

Sravya Popuri

Abstract

Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only LLMs to the task of speech-to-text translation (S2TT). We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation. Additionally, we investigate the effects of different parameter-efficient fine-tuning techniques and task formulation. Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data. We also conduct analyses to validate the design choices of our proposed model and bring insights to the integration of LLMs to S2TT.

View arXiv page View PDF Add to collection

Community

chaoweihuang

Paper author Paper submitter 18 days ago

In this paper, we investigate design choices for LLM-based speech-to-text translation (S2TT). Our architecture achieves state-of-the-art performance on CoVoST2 among models trained with only public S2TT datasets. The key findings are:

We showed that the decoder-only architecture outperforms encoder-decoder architecture when using a decoder-only LLM (LLaMA-2). Our hypothesis is that the newly-initialized cross-attention layers in the encoder-decoder architecture make training harder.
We demonstrated that LNA fine-tuning, where we fine-tune the attention layers and layernorm layers, outperforms LoRA significantly.
Fine-tuning the parameters of the speech encoder along with the text LLM is crucial for good performance. This indicates that discrete token-based speech LLMs might be harder to train due to the need to also update the speech encoder.
Incorporating different training formulations and instructions could boost performance.

This work was my internship project at Meta AI (FAIR).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.03169 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.03169 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.03169 in a Space README.md to link it from this page.