Model information

The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:

  • Machine Translation
  • Automatic Speech Recognition
  • Speech Translation
  • Speech Summarization
  • Spoken Question Answering
  • Spoken Language Understanding (beta)
  • Visual Speech Recognition (beta)

Model Developer: Meetween consortium

Supported Languages: English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.

Model Release Date: Feb 28, 2025

License: see LICENSE

Model Architecture

SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder (SeamlessM4T v2) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder (Auto-AVSR) and a video adapter.

Model Params Input modalities Output modalities Context Length
SpeechLMM 1.0 S 2B (2.17B) Multilingual text and audio, English video Multilingual Text 128k
SpeechLMM 1.0 M 4B (4.15B) Multilingual text and audio, English video Multilingual Text 128k
SpeechLMM 1.0 L 9B (8.98B) Multilingual text and audio, English video Multilingual Text 128k
SpeechLMM 1.0 XL (beta) 71B (71.5B) Multilingual text and audio, English video Multilingual Text 128k

Audio and video encoders

For all the 4 sizes of SpeechLMM 1.0, the audio encoder is SeamlessM4T v2 Large (facebook/seamless-m4t-v2-large) and the video encoder is Auto-AVSR (vsr_trlrs3vox2_base).

Audio and video adapters

For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:

Modality Architecture Number of layers Compression factor
Audio MLP 4 1
Video Window-level Q-former
(4 queries)
4 4

LLM backbone

Model Backbone
SpeechLMM 1.0 S Llama 3.2 1B Instruct
SpeechLMM 1.0 M Llama 3.2 3B Instruct
SpeechLMM 1.0 L Llama 3.1 8B Instruct
SpeechLMM 1.0 XL (beta) Llama 3.3 70B Instruct

How to use

Currently, this model can only be used via our speechlmm codebase. Refer to the instructions there for more details.

Important: before you can use this model, you must download the SeamlessM4T v2 speech encoder and the Auto-AVSR video encoder by following the instructions provided in the README of the above repo. Please note that by doing so, you agree with their respective license terms.

Training Data

Monolingual

TASK Task name Dataset Language License
ASR Automatic Speech Recognition LibriHeavy en CC-BY-4.0
LibriTTS en CC BY 4.0
AMI en CC-BY-4.0
ICSI en CC-BY-4.0
VSR Visual Speech Recognition LRS2-BBC en Custom
SSUM Speech Summarization AMI en CC-BY-4.0
ICSI en CC-BY-4.0
SQA Spoken Question Answering Spoken SQUAD en CC-BY-SA-4.0
SLU Spoken Language Understanding SLURP en CC BY 4.0 (text)
CC BY-NC 4.0 (audio)

Multilingual

TASK Task name Dataset Language License
ASR Automatic Speech Recognition CoVoST2 en, fr, it, de, es CC0
CommonVoice en, fr, it, de, es Apache-2.0
ST Speech-to-text Translation CoVoST2 en → de, {fr, it, de, es} → en CC0
EuroParl-ST {en, fr, it, de, es} → {en, fr, it, de, es} CC-BY-NC-4.0
MT Machine Translation EuroParl-ST {en, fr, it, de, es} → {en, fr, it, de, es} CC-BY-NC-4.0
TextInstruct Text Instruction Following Everything_Instruct_Multilingual en, fr, it, de, es, ru, zh, ko, ur, la, ar,
hi, ja, nl, pt
Apache-2.0
SLU Spoken Language Understanding Speech-Massive fr, de CC-BY-NC-SA-4.0

Evaluation Results

The following results specifically refer to the L model.

ASR Metrics

Dataset Language WER ⬇
MTEDX es 28
MTEDX it 32.36
MUSTC en 16.51
ACL6060 en 17.79
MTEDX fr 37.94

SQA Metrics

Dataset Language Accuracy ⬆
Spoken SQuAD en 73.59

NOTE: Accuracy is measured with an LLM as a judge (Llama3-70b-8192, via the Groq API) using the following prompts:

  • System prompt

    You are a helpful assistant that evaluates answers to questions given a certain context. You will be given inputs of the form:
    Context: <CONTEXT>
    Question: <QUESTION>
    Answer: <ANSWER>
    Your task is to determine if the given answer is correct or not, assuming the correct answer is contained in the context. Your response should be formatted as a JSON string having the following structure: {"correct_answer": <true/false>, "rationale": <RATIONALE>} where 'rationale' must be a string explaining why the answer is correct or incorrect. If you need to include double quote characters (") in the 'rationale' string, you must escape them with a backslash (\). For example, if you want to include the string "Hello, World!", you should write it as \"Hello, World!\".

  • User prompt

    Context: <CONTEXT>
    Question: <QUESTION>
    Answer: <ANSWER>

MT Metrics

Dataset Source Language Target Language Bleu ⬆ CHRF ⬆
FLORES en es 21.25 50.39
FLORES en it 18.86 49.8
FLORES en fr 30.18 60.14
ACL6060 en fr 32.45 62.68
FLORES en de 24.93 55.07

SSUM Metrics

Dataset Language R-1_F1 R-2_F1 R-L_F1
ICSI en 22.4 2.6 19.6

ST Metrics

Dataset Source Language Target Language Bleu ⬆ CHRF ⬆
MUSTC en de 17.87 46.51
MUSTC en it 15.33 43.81
MUSTC en fr 21.88 49.51
ACL6060 en fr 27.12 55.88
MUSTC en es 22.05 49.8
ACL6060 en de 21.63 51.46

Framework versions

  • Transformers 4.45.0
  • Pytorch 2.3.1+cu124.post2
  • Datasets 3.2.0
  • Tokenizers 0.20.0
Downloads last month
38
Safetensors
Model size
8.34B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.