metadata
language:
- en
datasets:
- mozilla-foundation/common_voice_13_0
- facebook/voxpopuli
- LIUM/tedlium
- librispeech_asr
- fisher_corpus
- Switchboard-1
- WSJ-0
metrics:
- wer
pipeline_tag: automatic-speech-recognition
model-index:
- name: tbd
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- type: wer
value: 2.5
name: Test WER
- type: wer
value: 5.6
name: Test WER
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: tedlium-v3
type: LIUM/tedlium
config: release1
split: test
args:
language: en
metrics:
- type: wer
value: 6.3
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vox Populi
type: facebook/voxpopuli
config: en
split: test
args:
language: en
metrics:
- type: wer
value: 7.3
name: Test WER
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 13.0
type: mozilla-foundation/common_voice_13_0
config: en
split: test
args:
language: en
metrics:
- type: wer
value: 12.1
name: Test WER
EBranchRegulaFormer
This is a 174M encoder-decoder Ebranchformer model trained with an intermediate regularization technique on 6,000 hours of open-source English data.
It achieves Word Error Rates (WERs) comparable to openai/whisper-medium.en
across multiple datasets with just 1/4 of the parameters.
Architecture details, training hyperparameters, and a description of the proposed technique will be added soon.
Disclaimer: The model currently hallucinates on segments containing silence only, as it was previously not trained on such data. The fix will be added soon.
The model can be used with the pipeline
class to transcribe audio files of arbitrary length.
from transformers import pipeline
model_id = "BUT-FIT/EBranchRegulaFormer-medium"
pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True)
# In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type.
# The warning can be ignored.
pipe.type = "seq2seq"
# Standard greedy decoding
result = pipe("audio.wav")
# Beam search decoding with joint CTC-autoregressive scorer
generation_config = pipe.model.generation_config
generation_config.ctc_weight = 0.3
generation_config.num_beams = 5
generation_config.ctc_margin = 0
result = pipe("audio.wav")