anzorq
/

w2v-bert-2.0-kbd-v2

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

w2v-bert-2.0-kbd-v2 / README.md

anzorq's picture

Update README.md

45b0962 verified 8 months ago

|

3.57 kB

	---
	license: mit
	language:
	- kbd
	datasets:
	- anzorq/kbd_speech
	- anzorq/sixuxar_yijiri_mak7
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	---

	# Circassian (Kabardian) ASR Model

	This is a fine-tuned model for Automatic Speech Recognition (ASR) in `kbd`, based on the `facebook/w2v-bert-2.0` model.

	The model was trained on a combination of the `anzorq/kbd_speech` (filtered on `country=russia`) and `anzorq/sixuxar_yijiri_mak7` datasets.

	## Model Details

	- Base Model: facebook/w2v-bert-2.0
	- Language: Kabardian
	- Task: Automatic Speech Recognition (ASR)
	- Datasets: anzorq/kbd_speech, anzorq/sixuxar_yijiri_mak7
	- Training Steps: 4000

	## Training

	The model was fine-tuned using the following training arguments:

	```python
	TrainingArguments(
	output_dir='output',
	group_by_length=True,
	per_device_train_batch_size=8,
	gradient_accumulation_steps=2,
	evaluation_strategy="steps",
	num_train_epochs=10,
	gradient_checkpointing=True,
	fp16=True,
	save_steps=1000,
	eval_steps=500,
	logging_steps=300,
	learning_rate=5e-5,
	warmup_steps=500,
	save_total_limit=2,
	push_to_hub=True,
	report_to="wandb"
	)
	```

	## Performance

	The model's performance during training:

	\| Step \| Training Loss \| Validation Loss \| Wer \|
	\|------\|---------------\|-----------------\|----------\|
	\| 500 \| 2.761100 \| 0.572304 \| 0.830552 \|
	\| 1000 \| 0.325700 \| 0.352516 \| 0.678261 \|
	\| 1500 \| 0.247000 \| 0.271146 \| 0.377438 \|
	\| 2000 \| 0.179300 \| 0.235156 \| 0.319859 \|
	\| 2500 \| 0.176100 \| 0.229383 \| 0.293537 \|
	\| 3000 \| 0.171600 \| 0.208033 \| 0.310458 \|
	\| 3500 \| 0.133200 \| 0.199517 \| 0.289542 \|
	\| 4000 \| 0.117900 \| 0.208304 \| 0.258989 \| <-- this model
	\| 4500 \| 0.145400 \| 0.184942 \| 0.285311 \|
	\| 5000 \| 0.129600 \| 0.195167 \| 0.372033 \|
	\| 5500 \| 0.122600 \| 0.203584 \| 0.386369 \|
	\| 6000 \| 0.196800 \| 0.270521 \| 0.687662 \|

	## Note
	To optimize training and reduce tokenizer vocabulary size, prior to training the following digraphs in the training data were replaced with single characters:
	```
	гъ -> ɣ
	дж -> j
	дз -> ӡ
	жь -> ʐ
	кӏ -> қ
	къ -> q
	кхъ -> qҳ
	лъ -> ɬ
	лӏ -> ԯ
	пӏ -> ԥ
	тӏ -> ҭ
	фӏ -> ჶ
	хь -> h
	хъ -> ҳ
	цӏ -> ҵ
	щӏ -> ɕ
	я -> йа
	```
	After obtaining the transcription, reversed replacements can be applied to restore the original characters.

	## Inference
	```python
	import torchaudio
	from transformers import pipeline

	pipe = pipeline(model="anzorq/w2v-bert-2.0-kbd-v2", device=0)

	reversed_replacements = {
	'ɣ': 'гъ', 'j': 'дж', 'ӡ': 'дз', 'ʐ': 'жь',
	'қ': 'кӏ', 'q': 'къ', 'qҳ': 'кхъ', 'ɬ': 'лъ',
	'ԯ': 'лӏ', 'ԥ': 'пӏ', 'ҭ': 'тӏ', 'ჶ': 'фӏ',
	'h': 'хь', 'ҳ': 'хъ', 'ҵ': 'цӏ', 'ɕ': 'щӏ',
	'йа': 'я'
	}

	def reverse_replace_symbols(text):
	for orig, replacement in reversed_replacements.items():
	text = text.replace(orig, replacement)
	return text

	def transcribe_speech(audio_path):
	waveform, sample_rate = torchaudio.load(audio_path)
	waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)
	torchaudio.save("temp.wav", waveform, 16000)
	transcription = pipe("temp.wav", chunk_length_s=10)['text']
	transcription = reverse_replace_symbols(transcription)
	return transcription

	audio_path = "audio.wav"
	transcription = transcribe_speech(audio_path)
	print(f"Transcription: {transcription}")

	```