|
--- |
|
language: |
|
- en |
|
datasets: |
|
- mozilla-foundation/common_voice_13_0 |
|
- facebook/voxpopuli |
|
- LIUM/tedlium |
|
- librispeech_asr |
|
- fisher_corpus |
|
- Switchboard-1 |
|
- WSJ-0 |
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
model-index: |
|
- name: tbd |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: LibriSpeech (clean) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 2.5 |
|
name: Test WER |
|
- type: wer |
|
value: 5.6 |
|
name: Test WER |
|
- task: |
|
type: Automatic Speech Recognition |
|
name: automatic-speech-recognition |
|
dataset: |
|
name: tedlium-v3 |
|
type: LIUM/tedlium |
|
config: release1 |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 6.3 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Vox Populi |
|
type: facebook/voxpopuli |
|
config: en |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 7.3 |
|
name: Test WER |
|
- task: |
|
type: Automatic Speech Recognition |
|
name: automatic-speech-recognition |
|
dataset: |
|
name: Mozilla Common Voice 13.0 |
|
type: mozilla-foundation/common_voice_13_0 |
|
config: en |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 12.1 |
|
name: Test WER |
|
--- |
|
# EBranchRegulaFormer |
|
This is a **174M encoder-decoder Ebranchformer model** trained with an intermediate regularization technique on 6,000 hours of open-source English data. |
|
It achieves Word Error Rates (WERs) comparable to `openai/whisper-medium.en` across multiple datasets with just 1/4 of the parameters. |
|
|
|
Architecture details, training hyperparameters, and a description of the proposed technique will be added soon. |
|
|
|
*Disclaimer: The model currently hallucinates on segments containing silence only, as it was previously not trained on such data. The fix will be added soon.* |
|
|
|
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) |
|
class to transcribe audio files of arbitrary length. |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_id = "BUT-FIT/EBranchRegulaFormer-medium" |
|
pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True) |
|
# In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type. |
|
# The warning can be ignored. |
|
pipe.type = "seq2seq" |
|
|
|
# Standard greedy decoding |
|
result = pipe("audio.wav") |
|
|
|
# Beam search decoding with joint CTC-autoregressive scorer |
|
generation_config = pipe.model.generation_config |
|
generation_config.ctc_weight = 0.3 |
|
generation_config.num_beams = 5 |
|
generation_config.ctc_margin = 0 |
|
result = pipe("audio.wav") |
|
``` |