Whisper-Large-V3-Distil-Multi7-v0.2
A multilingual distilled Whisper model with 2 decoder layers, supporting 7 European languages: English, French, Spanish, German, Italian, Portuguese, and Dutch.
The model was trained during my work on Distil-Large-v3.5.
A notable feature is its native support for code-switching. The model has the ability to switch languages within a single segment transcription by automatically producing a new language token when it detects a language change (as demonstrated in the following example).
The <|yue|>
language token has been repurposed during training to act as an automatic language detection token that enables code-switching during inference. To use this feature, simply set the language parameter to cantonese
(used by default).
The model's performance is below both the monolingual distilled version and Whisper-Large-v3-Turbo. Future work should investigate better training procedures and possibly incorporate more data to effectively compress multilingual capabilities into a single model.
Table of Contents
Usage
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi7-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]
print(text)
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
predicted_ids = model.generate(
input_features.to(device, dtype=torch_dtype),
max_new_tokens=128,
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
Evaluation
English
Model |
LIUM_tedlium |
mcv17 |
voxpopuli |
fleurs |
kensho_spgispeech |
librispeech-test_clean |
librispeech-test_other |
speechcolab_gigaspeech |
openai/whisper-large-v3 |
10.58 |
10.13 |
8.93 |
5.72 |
2.95 |
1.87 |
3.58 |
10.07 |
openai/whisper-large-v3-turbo |
10.20 |
11.74 |
11.78 |
6.13 |
2.95 |
1.98 |
3.94 |
10.11 |
distil-whisper/distil-large-v3 |
8.93 |
12.41 |
7.72 |
7.59 |
3.25 |
2.42 |
5.11 |
10.08 |
distil-whisper/distil-large-v3.5 |
8.65 |
11.07 |
7.54 |
6.74 |
2.86 |
2.28 |
4.94 |
9.84 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
8.88 |
11.33 |
7.60 |
6.97 |
3.03 |
2.51 |
5.24 |
10.12 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
9.36 |
11.32 |
7.65 |
7.02 |
2.99 |
2.46 |
5.24 |
10.06 |
French
Model |
mcv17 |
mls |
voxpopuli |
mtedx |
af_accented |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
10.98 |
4.69 |
11.15 |
8.67 |
7.51 |
5.4 |
9.87 |
8.97 |
9 |
8.01 |
openai/whisper-large-v3-turbo |
12.41 |
5.1 |
12.21 |
9.87 |
8.37 |
5.48 |
10.12 |
9 |
8.49 |
8.39 |
bofenghuang/whisper_large_v3_distil_fr_v0.2 |
11.1 |
5 |
10.68 |
8.75 |
7.09 |
6.35 |
9.44 |
9.84 |
8.94 |
8.93 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
11.96 |
6.04 |
11.07 |
9.16 |
7.99 |
7.10 |
10.42 |
12.61 |
9.06 |
11.75 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
12.19 |
6.2 |
11.29 |
9.13 |
8.26 |
7.17 |
10.04 |
12.26 |
8.93 |
11.56 |
Spanish
Model |
mcv17 |
mls |
voxpopuli |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
4.91 |
3.97 |
11.06 |
6.52 |
4.22 |
10.85 |
10.36 |
5.90 |
5.22 |
openai/whisper-large-v3-turbo |
5.74 |
4.41 |
16.02 |
6.66 |
4.59 |
11.55 |
10.68 |
6.46 |
5.41 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
5.58 |
4.34 |
8.52 |
7.43 |
5.20 |
11.26 |
13.43 |
5.69 |
8.95 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
5.70 |
4.35 |
8.55 |
7.56 |
5.15 |
11.45 |
13.54 |
5.84 |
8.27 |
German
Model |
mcv17 |
mls |
voxpopuli |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
6.11 |
5.60 |
17.75 |
19.63 |
5.92 |
11.21 |
10.35 |
17.64 |
17.76 |
openai/whisper-large-v3-turbo |
7.45 |
6.43 |
20.48 |
20.00 |
6.45 |
10.57 |
9.70 |
18.04 |
18.37 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
7.31 |
6.45 |
12.41 |
21.48 |
8.20 |
11.04 |
13.55 |
19.54 |
21.76 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
7.57 |
6.67 |
12.42 |
21.95 |
8.28 |
11.21 |
13.84 |
19.90 |
21.67 |
Italian
Model |
mcv17 |
mls |
voxpopuli |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
5.71 |
9.58 |
28.45 |
7.21 |
4.28 |
6.95 |
6.37 |
6.83 |
7.28 |
openai/whisper-large-v3-turbo |
6.77 |
10.64 |
30.69 |
7.41 |
4.69 |
6.88 |
6.52 |
7.98 |
7.37 |
bofenghuang/whisper_large_v3_distil_it_v0.2 |
6.15 |
9.22 |
17.27 |
7.52 |
5.26 |
6.06 |
6.99 |
7.84 |
8.42 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
6.78 |
11.42 |
17.53 |
8.07 |
5.68 |
7.04 |
9.51 |
7.51 |
10.47 |
Portuguese
Model |
mcv17 |
mls |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
6.76 |
7.04 |
8.91 |
5.86 |
12.11 |
12.39 |
8.70 |
8.98 |
openai/whisper-large-v3-turbo |
7.66 |
6.64 |
8.84 |
6.11 |
12.42 |
11.62 |
10.97 |
9.04 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
8.31 |
6.75 |
10.11 |
7.10 |
12.74 |
14.97 |
9.64 |
11.78 |
Dutch
Model |
mcv17 |
mls |
voxpopuli |
fleurs |
openai/whisper-large-v3 |
4.51 |
66.95 |
23.35 |
6.99 |
openai/whisper-large-v3-turbo |
6.16 |
52.37 |
27.42 |
7.59 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
6.76 |
14.82 |
14.92 |
10.86 |