meetween/Llama-speechlmm-1.0-l

Model information

The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:

Machine Translation
Automatic Speech Recognition
Speech Translation
Speech Summarization
Spoken Question Answering
Spoken Language Understanding (beta)
Visual Speech Recognition (beta)

Model Developer: Meetween consortium

Supported Languages: English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.

Model Release Date: Feb 28, 2025

License: see LICENSE

Model Architecture

SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder (SeamlessM4T v2) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder (Auto-AVSR) and a video adapter.

Model	Params	Input modalities	Output modalities	Context Length
SpeechLMM 1.0 S	2B (2.17B)	Multilingual text and audio, English video	Multilingual Text	128k
SpeechLMM 1.0 M	4B (4.15B)	Multilingual text and audio, English video	Multilingual Text	128k
SpeechLMM 1.0 L	9B (8.98B)	Multilingual text and audio, English video	Multilingual Text	128k
SpeechLMM 1.0 XL (beta)	71B (71.5B)	Multilingual text and audio, English video	Multilingual Text	128k

Audio and video encoders

For all the 4 sizes of SpeechLMM 1.0, the audio encoder is SeamlessM4T v2 Large (facebook/seamless-m4t-v2-large) and the video encoder is Auto-AVSR (vsr_trlrs3vox2_base).

Audio and video adapters

For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:

Modality	Architecture	Number of layers	Compression factor
Audio	MLP	4	1
Video	Window-level Q-former (4 queries)	4	4

LLM backbone

Model	Backbone
SpeechLMM 1.0 S	Llama 3.2 1B Instruct
SpeechLMM 1.0 M	Llama 3.2 3B Instruct
SpeechLMM 1.0 L	Llama 3.1 8B Instruct
SpeechLMM 1.0 XL (beta)	Llama 3.3 70B Instruct

How to use

Currently, this model can only be used via our speechlmm codebase. Refer to the instructions there for more details.

Important: before you can use this model, you must download the SeamlessM4T v2 speech encoder and the Auto-AVSR video encoder by following the instructions provided in the README of the above repo. Please note that by doing so, you agree with their respective license terms.

Training Data

Monolingual

TASK	Task name	Dataset	Language	License
ASR	Automatic Speech Recognition	LibriHeavy	en	CC-BY-4.0
		LibriTTS	en	CC BY 4.0
		AMI	en	CC-BY-4.0
		ICSI	en	CC-BY-4.0
VSR	Visual Speech Recognition	LRS2-BBC	en	Custom
SSUM	Speech Summarization	AMI	en	CC-BY-4.0
		ICSI	en	CC-BY-4.0
SQA	Spoken Question Answering	Spoken SQUAD	en	CC-BY-SA-4.0
SLU	Spoken Language Understanding	SLURP	en	CC BY 4.0 (text) CC BY-NC 4.0 (audio)

Multilingual

TASK	Task name	Dataset	Language	License
ASR	Automatic Speech Recognition	CoVoST2	en, fr, it, de, es	CC0
		CommonVoice	en, fr, it, de, es	Apache-2.0
ST	Speech-to-text Translation	CoVoST2	en → de, {fr, it, de, es} → en	CC0
		EuroParl-ST	{en, fr, it, de, es} → {en, fr, it, de, es}	CC-BY-NC-4.0
MT	Machine Translation	EuroParl-ST	{en, fr, it, de, es} → {en, fr, it, de, es}	CC-BY-NC-4.0
TextInstruct	Text Instruction Following	Everything_Instruct_Multilingual	en, fr, it, de, es, ru, zh, ko, ur, la, ar, hi, ja, nl, pt	Apache-2.0
SLU	Spoken Language Understanding	Speech-Massive	fr, de	CC-BY-NC-SA-4.0

Evaluation Results

The following results specifically refer to the L model.

ASR Metrics

Dataset	Language	WER ⬇
MTEDX	es	28
MTEDX	it	32.36
MUSTC	en	16.51
ACL6060	en	17.79
MTEDX	fr	37.94

SQA Metrics

Dataset	Language	Accuracy ⬆
Spoken SQuAD	en	73.59

NOTE: Accuracy is measured with an LLM as a judge (Llama3-70b-8192, via the Groq API) using the following prompts:

System prompt

You are a helpful assistant that evaluates answers to questions given a certain context. You will be given inputs of the form:
Context: <CONTEXT>
Question: <QUESTION>
Answer: <ANSWER>
Your task is to determine if the given answer is correct or not, assuming the correct answer is contained in the context. Your response should be formatted as a JSON string having the following structure: {"correct_answer": <true/false>, "rationale": <RATIONALE>} where 'rationale' must be a string explaining why the answer is correct or incorrect. If you need to include double quote characters (") in the 'rationale' string, you must escape them with a backslash (\). For example, if you want to include the string "Hello, World!", you should write it as \"Hello, World!\".
User prompt

Context: <CONTEXT>
Question: <QUESTION>
Answer: <ANSWER>

MT Metrics

Dataset	Source Language	Target Language	Bleu ⬆	CHRF ⬆
FLORES	en	es	21.25	50.39
FLORES	en	it	18.86	49.8
FLORES	en	fr	30.18	60.14
ACL6060	en	fr	32.45	62.68
FLORES	en	de	24.93	55.07

SSUM Metrics

Dataset	Language	R-1_F1	R-2_F1	R-L_F1
ICSI	en	22.4	2.6	19.6

ST Metrics

Dataset	Source Language	Target Language	Bleu ⬆	CHRF ⬆
MUSTC	en	de	17.87	46.51
MUSTC	en	it	15.33	43.81
MUSTC	en	fr	21.88	49.51
ACL6060	en	fr	27.12	55.88
MUSTC	en	es	22.05	49.8
ACL6060	en	de	21.63	51.46

Framework versions

Transformers 4.45.0
Pytorch 2.3.1+cu124.post2
Datasets 3.2.0
Tokenizers 0.20.0

meetween
/

Llama-speechlmm-1.0-l