Model information
The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
- Machine Translation
- Automatic Speech Recognition
- Speech Translation
- Speech Summarization
- Spoken Question Answering
- Spoken Language Understanding (beta)
- Visual Speech Recognition (beta)
Model Developer: Meetween consortium
Supported Languages: English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.
Model Release Date: Feb 28, 2025
License: see LICENSE
Model Architecture
SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder (SeamlessM4T v2) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder (Auto-AVSR) and a video adapter.
Model | Params | Input modalities | Output modalities | Context Length |
---|---|---|---|---|
SpeechLMM 1.0 S | 2B (2.17B) | Multilingual text and audio, English video | Multilingual Text | 128k |
SpeechLMM 1.0 M | 4B (4.15B) | Multilingual text and audio, English video | Multilingual Text | 128k |
SpeechLMM 1.0 L | 9B (8.98B) | Multilingual text and audio, English video | Multilingual Text | 128k |
SpeechLMM 1.0 XL (beta) | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k |
Audio and video encoders
For all the 4 sizes of SpeechLMM 1.0, the audio encoder is SeamlessM4T v2 Large (facebook/seamless-m4t-v2-large
) and the video encoder is Auto-AVSR (vsr_trlrs3vox2_base
).
Audio and video adapters
For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
Modality | Architecture | Number of layers | Compression factor |
---|---|---|---|
Audio | MLP | 4 | 1 |
Video | Window-level Q-former (4 queries) |
4 | 4 |
LLM backbone
Model | Backbone |
---|---|
SpeechLMM 1.0 S | Llama 3.2 1B Instruct |
SpeechLMM 1.0 M | Llama 3.2 3B Instruct |
SpeechLMM 1.0 L | Llama 3.1 8B Instruct |
SpeechLMM 1.0 XL (beta) | Llama 3.3 70B Instruct |
How to use
Currently, this model can only be used via our speechlmm
codebase. Refer to the instructions there for more details.
Important: before you can use this model, you must download the SeamlessM4T v2 speech encoder and the Auto-AVSR video encoder by following the instructions provided in the README of the above repo. Please note that by doing so, you agree with their respective license terms.
Training Data
Monolingual
TASK | Task name | Dataset | Language | License |
---|---|---|---|---|
ASR | Automatic Speech Recognition | LibriHeavy | en | CC-BY-4.0 |
LibriTTS | en | CC BY 4.0 | ||
AMI | en | CC-BY-4.0 | ||
ICSI | en | CC-BY-4.0 | ||
VSR | Visual Speech Recognition | LRS2-BBC | en | Custom |
SSUM | Speech Summarization | AMI | en | CC-BY-4.0 |
ICSI | en | CC-BY-4.0 | ||
SQA | Spoken Question Answering | Spoken SQUAD | en | CC-BY-SA-4.0 |
SLU | Spoken Language Understanding | SLURP | en | CC BY 4.0 (text) CC BY-NC 4.0 (audio) |
Multilingual
TASK | Task name | Dataset | Language | License |
---|---|---|---|---|
ASR | Automatic Speech Recognition | CoVoST2 | en, fr, it, de, es | CC0 |
CommonVoice | en, fr, it, de, es | Apache-2.0 | ||
ST | Speech-to-text Translation | CoVoST2 | en → de, {fr, it, de, es} → en | CC0 |
EuroParl-ST | {en, fr, it, de, es} → {en, fr, it, de, es} | CC-BY-NC-4.0 | ||
MT | Machine Translation | EuroParl-ST | {en, fr, it, de, es} → {en, fr, it, de, es} | CC-BY-NC-4.0 |
TextInstruct | Text Instruction Following | Everything_Instruct_Multilingual | en, fr, it, de, es, ru, zh, ko, ur, la, ar, hi, ja, nl, pt |
Apache-2.0 |
SLU | Spoken Language Understanding | Speech-Massive | fr, de | CC-BY-NC-SA-4.0 |
Evaluation Results
The following results specifically refer to the L model.
ASR Metrics
Dataset | Language | WER ⬇ |
---|---|---|
MTEDX | es | 28 |
MTEDX | it | 32.36 |
MUSTC | en | 16.51 |
ACL6060 | en | 17.79 |
MTEDX | fr | 37.94 |
SQA Metrics
Dataset | Language | Accuracy ⬆ |
---|---|---|
Spoken SQuAD | en | 73.59 |
NOTE: Accuracy is measured with an LLM as a judge (Llama3-70b-8192, via the Groq API) using the following prompts:
System prompt
You are a helpful assistant that evaluates answers to questions given a certain context. You will be given inputs of the form:
Context: <CONTEXT>
Question: <QUESTION>
Answer: <ANSWER>
Your task is to determine if the given answer is correct or not, assuming the correct answer is contained in the context. Your response should be formatted as a JSON string having the following structure: {"correct_answer": <true/false>, "rationale": <RATIONALE>} where 'rationale' must be a string explaining why the answer is correct or incorrect. If you need to include double quote characters (") in the 'rationale' string, you must escape them with a backslash (\). For example, if you want to include the string "Hello, World!", you should write it as \"Hello, World!\".User prompt
Context: <CONTEXT>
Question: <QUESTION>
Answer: <ANSWER>
MT Metrics
Dataset | Source Language | Target Language | Bleu ⬆ | CHRF ⬆ |
---|---|---|---|---|
FLORES | en | es | 21.25 | 50.39 |
FLORES | en | it | 18.86 | 49.8 |
FLORES | en | fr | 30.18 | 60.14 |
ACL6060 | en | fr | 32.45 | 62.68 |
FLORES | en | de | 24.93 | 55.07 |
SSUM Metrics
Dataset | Language | R-1_F1 | R-2_F1 | R-L_F1 |
---|---|---|---|---|
ICSI | en | 22.4 | 2.6 | 19.6 |
ST Metrics
Dataset | Source Language | Target Language | Bleu ⬆ | CHRF ⬆ |
---|---|---|---|---|
MUSTC | en | de | 17.87 | 46.51 |
MUSTC | en | it | 15.33 | 43.81 |
MUSTC | en | fr | 21.88 | 49.51 |
ACL6060 | en | fr | 27.12 | 55.88 |
MUSTC | en | es | 22.05 | 49.8 |
ACL6060 | en | de | 21.63 | 51.46 |
Framework versions
- Transformers 4.45.0
- Pytorch 2.3.1+cu124.post2
- Datasets 3.2.0
- Tokenizers 0.20.0
- Downloads last month
- 38