BUT-FIT
/

EBranchRegulaFormer-medium

Automatic Speech Recognition

joint_aed_ctc_speech-encoder-decoder

Model card Files Files and versions Community

EBranchRegulaFormer-medium / README.md

Lakoc's picture

Update README.md

9edbe83 verified 3 months ago

|

history blame contribute delete

No virus

2.98 kB

	---
	language:
	- en
	datasets:
	- mozilla-foundation/common_voice_13_0
	- facebook/voxpopuli
	- LIUM/tedlium
	- librispeech_asr
	- fisher_corpus
	- Switchboard-1
	- WSJ-0
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	model-index:
	- name: tbd
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: LibriSpeech (clean)
	type: librispeech_asr
	config: other
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 2.5
	name: Test WER
	- type: wer
	value: 5.6
	name: Test WER
	- task:
	type: Automatic Speech Recognition
	name: automatic-speech-recognition
	dataset:
	name: tedlium-v3
	type: LIUM/tedlium
	config: release1
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 6.3
	name: Test WER
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Vox Populi
	type: facebook/voxpopuli
	config: en
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 7.3
	name: Test WER
	- task:
	type: Automatic Speech Recognition
	name: automatic-speech-recognition
	dataset:
	name: Mozilla Common Voice 13.0
	type: mozilla-foundation/common_voice_13_0
	config: en
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 12.1
	name: Test WER
	---
	# EBranchRegulaFormer
	This is a 174M encoder-decoder Ebranchformer model trained with an intermediate regularization technique on 6,000 hours of open-source English data.
	It achieves Word Error Rates (WERs) comparable to `openai/whisper-medium.en` across multiple datasets with just 1/4 of the parameters.

	Architecture details, training hyperparameters, and a description of the proposed technique will be added soon.

	Disclaimer: The model currently hallucinates on segments containing silence only, as it was previously not trained on such data. The fix will be added soon.

	The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
	class to transcribe audio files of arbitrary length.

	```python
	from transformers import pipeline

	model_id = "BUT-FIT/EBranchRegulaFormer-medium"
	pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True)
	# In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type.
	# The warning can be ignored.
	pipe.type = "seq2seq"

	# Standard greedy decoding
	result = pipe("audio.wav")

	# Beam search decoding with joint CTC-autoregressive scorer
	generation_config = pipe.model.generation_config
	generation_config.ctc_weight = 0.3
	generation_config.num_beams = 5
	generation_config.ctc_margin = 0
	result = pipe("audio.wav")
	```