Yehor
/

w2v-bert-uk

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

w2v-bert-uk / README.md

Yehor's picture

Update README.md

7de3b2c verified 4 days ago

|

history blame contribute delete

No virus

2.74 kB

	---
	base_model: facebook/w2v-bert-2.0
	datasets:
	- common_voice_10_0
	metrics:
	- wer
	model-index:
	- name: w2v-bert-2.0-uk
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: common_voice_10_0
	type: common_voice_10_0
	config: uk
	split: test
	args: uk
	metrics:
	- name: Wer
	type: wer
	value: 0.0655
	license: apache-2.0
	---

	🚨🚨🚨 ATTENTION! 🚨🚨🚨

	Use an updated model: https://huggingface.co/Yehor/w2v-bert-uk-v2.1

	---

	# w2v-bert-uk `v1`

	## Community

	- Discord: https://discord.gg/yVAjkBgmt4
	- Speech Recognition: https://t.me/speech_recognition_uk
	- Speech Synthesis: https://t.me/speech_synthesis_uk

	## Google Colab

	You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing

	## Metrics

	- AM:
	- WER: 0.0727
	- CER: 0.0151
	- Accuracy: 92.73%
	- AM + LM:
	- WER: 0.0655
	- CER: 0.0139
	- Accuracy: 93.45%

	## Hyperparameters

	This model was trained with the following hparams using 2 RTX A4000:

	```
	torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \
	--custom_set ~/cv10/train.csv \
	--custom_set_eval ~/cv10/test.csv \
	--num_train_epochs 15 \
	--tokenize_config . \
	--w2v2_bert_model facebook/w2v-bert-2.0 \
	--batch 4 \
	--num_proc 5 \
	--grad_accum 1 \
	--learning_rate 3e-5 \
	--logging_steps 20 \
	--eval_step 500 \
	--group_by_length \
	--attention_dropout 0.0 \
	--activation_dropout 0.05 \
	--feat_proj_dropout 0.05 \
	--feat_quantizer_dropout 0.0 \
	--hidden_dropout 0.05 \
	--layerdrop 0.0 \
	--final_dropout 0.0 \
	--mask_time_prob 0.0 \
	--mask_time_length 10 \
	--mask_feature_prob 0.0 \
	--mask_feature_length 10
	```

	## Usage

	```python
	# pip install -U torch soundfile transformers

	import torch
	import soundfile as sf
	from transformers import AutoModelForCTC, Wav2Vec2BertProcessor

	# Config
	model_name = 'Yehor/w2v-bert-2.0-uk'
	device = 'cuda:1' # or cpu
	sampling_rate = 16_000

	# Load the model
	asr_model = AutoModelForCTC.from_pretrained(model_name).to(device)
	processor = Wav2Vec2BertProcessor.from_pretrained(model_name)

	paths = [
	'sample1.wav',
	]

	# Extract audio
	audio_inputs = []
	for path in paths:
	audio_input, _ = sf.read(path)
	audio_inputs.append(audio_input)

	# Transcribe the audio
	inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features
	features = torch.tensor(inputs).to(device)

	with torch.no_grad():
	logits = asr_model(features).logits

	predicted_ids = torch.argmax(logits, dim=-1)
	predictions = processor.batch_decode(predicted_ids)

	# Log results
	print('Predictions:')
	print(predictions)
	```