Whisper-Large-V3-Distil-Multi7-v0.2

A multilingual distilled Whisper model with 2 decoder layers, supporting 7 European languages: English, French, Spanish, German, Italian, Portuguese, and Dutch.

The model was trained during my work on Distil-Large-v3.5.

A notable feature is its native support for code-switching. The model has the ability to switch languages within a single segment transcription by automatically producing a new language token when it detects a language change (as demonstrated in the following example).

The <|yue|> language token has been repurposed during training to act as an automatic language detection token that enables code-switching during inference. To use this feature, simply set the language parameter to cantonese (used by default).

The model's performance is below both the monolingual distilled version and Whisper-Large-v3-Turbo. Future work should investigate better training procedures and possibly incorporate more data to effectively compress multilingual capabilities into a single model.

Table of Contents

Usage

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi7-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]

# Ground truth text
print(text)
# Aber sei ihnen nicht böse, Habibi, vergib ihnen, sie vergaßen die Liebe, sie vergaßen die Bibel, 
# wünsch ihnen den Frieden. Nous allons construire des radiotélescopes géants comme celui-ci, 
# qui est mon préféré. Questa è un'immagine di Cairo Open City, una mostra che il museo Folkwang di 
# Essen ha dedicato al ruolo della mobile photography nella primavera Araba.

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(device, dtype=torch_dtype),
    max_new_tokens=128,
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
#  Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden. Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré. Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

# Dive in generated tokens
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
# <|de|> Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden.<|fr|> Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré.<|es|> Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

Evaluation

English

Model LIUM_tedlium mcv17 voxpopuli fleurs kensho_spgispeech librispeech-test_clean librispeech-test_other speechcolab_gigaspeech
openai/whisper-large-v3 10.58 10.13 8.93 5.72 2.95 1.87 3.58 10.07
openai/whisper-large-v3-turbo 10.20 11.74 11.78 6.13 2.95 1.98 3.94 10.11
distil-whisper/distil-large-v3 8.93 12.41 7.72 7.59 3.25 2.42 5.11 10.08
distil-whisper/distil-large-v3.5 8.65 11.07 7.54 6.74 2.86 2.28 4.94 9.84
bofenghuang/whisper-large-v3-distil-multi4-v0.2 8.88 11.33 7.60 6.97 3.03 2.51 5.24 10.12
bofenghuang/whisper-large-v3-distil-multi7-v0.2 9.36 11.32 7.65 7.02 2.99 2.46 5.24 10.06

French

Model mcv17 mls voxpopuli mtedx af_accented fleurs hf_dev_data_chunk30 hf_dev_data_sequential mtedx_chunk30 mtedx_sequential
openai/whisper-large-v3 10.98 4.69 11.15 8.67 7.51 5.4 9.87 8.97 9 8.01
openai/whisper-large-v3-turbo 12.41 5.1 12.21 9.87 8.37 5.48 10.12 9 8.49 8.39
bofenghuang/whisper_large_v3_distil_fr_v0.2 11.1 5 10.68 8.75 7.09 6.35 9.44 9.84 8.94 8.93
bofenghuang/whisper-large-v3-distil-multi4-v0.2 11.96 6.04 11.07 9.16 7.99 7.10 10.42 12.61 9.06 11.75
bofenghuang/whisper-large-v3-distil-multi7-v0.2 12.19 6.2 11.29 9.13 8.26 7.17 10.04 12.26 8.93 11.56

Spanish

Model mcv17 mls voxpopuli mtedx fleurs hf_dev_data_chunk30 hf_dev_data_sequential mtedx_chunk30 mtedx_sequential
openai/whisper-large-v3 4.91 3.97 11.06 6.52 4.22 10.85 10.36 5.90 5.22
openai/whisper-large-v3-turbo 5.74 4.41 16.02 6.66 4.59 11.55 10.68 6.46 5.41
bofenghuang/whisper-large-v3-distil-multi4-v0.2 5.58 4.34 8.52 7.43 5.20 11.26 13.43 5.69 8.95
bofenghuang/whisper-large-v3-distil-multi7-v0.2 5.70 4.35 8.55 7.56 5.15 11.45 13.54 5.84 8.27

German

Model mcv17 mls voxpopuli mtedx fleurs hf_dev_data_chunk30 hf_dev_data_sequential mtedx_chunk30 mtedx_sequential
openai/whisper-large-v3 6.11 5.60 17.75 19.63 5.92 11.21 10.35 17.64 17.76
openai/whisper-large-v3-turbo 7.45 6.43 20.48 20.00 6.45 10.57 9.70 18.04 18.37
bofenghuang/whisper-large-v3-distil-multi4-v0.2 7.31 6.45 12.41 21.48 8.20 11.04 13.55 19.54 21.76
bofenghuang/whisper-large-v3-distil-multi7-v0.2 7.57 6.67 12.42 21.95 8.28 11.21 13.84 19.90 21.67

Italian

Model mcv17 mls voxpopuli mtedx fleurs hf_dev_data_chunk30 hf_dev_data_sequential mtedx_chunk30 mtedx_sequential
openai/whisper-large-v3 5.71 9.58 28.45 7.21 4.28 6.95 6.37 6.83 7.28
openai/whisper-large-v3-turbo 6.77 10.64 30.69 7.41 4.69 6.88 6.52 7.98 7.37
bofenghuang/whisper_large_v3_distil_it_v0.2 6.15 9.22 17.27 7.52 5.26 6.06 6.99 7.84 8.42
bofenghuang/whisper-large-v3-distil-multi7-v0.2 6.78 11.42 17.53 8.07 5.68 7.04 9.51 7.51 10.47

Portuguese

Model mcv17 mls mtedx fleurs hf_dev_data_chunk30 hf_dev_data_sequential mtedx_chunk30 mtedx_sequential
openai/whisper-large-v3 6.76 7.04 8.91 5.86 12.11 12.39 8.70 8.98
openai/whisper-large-v3-turbo 7.66 6.64 8.84 6.11 12.42 11.62 10.97 9.04
bofenghuang/whisper-large-v3-distil-multi7-v0.2 8.31 6.75 10.11 7.10 12.74 14.97 9.64 11.78

Dutch

Model mcv17 mls voxpopuli fleurs
openai/whisper-large-v3 4.51 66.95 23.35 6.99
openai/whisper-large-v3-turbo 6.16 52.37 27.42 7.59
bofenghuang/whisper-large-v3-distil-multi7-v0.2 6.76 14.82 14.92 10.86
Downloads last month
6
Safetensors
Model size
756M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including bofenghuang/whisper-large-v3-distil-multi7-v0.2