Sharif-wav2vec2 / README.md

Update README.md

876dff9 almost 2 years ago

4.38 kB

	---
	language: fa
	datasets:
	- common_voice_6_1
	tags:
	- audio
	- automatic-speech-recognition
	license: apache-2.0
	#widget:
	#- example_title: Librispeech sample 1
	# src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
	#- example_title: Librispeech sample 2
	# src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
	model-index:
	- name: Sharif-wav2vec2
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice Corpus 6.1 (clean)
	type: common_voice_6_1
	config: clean
	split: test
	args:
	language: fa
	metrics:
	- name: Test WER
	type: wer
	value: 6.0
	#- task:
	# name: Automatic Speech Recognition
	# type: automatic-speech-recognition
	# dataset:
	# name: LibriSpeech (other)
	# type: librispeech_asr
	# config: other
	# split: test
	# args:
	# language: en
	# metrics:
	# - name: Test WER
	# type: wer
	# value: 8.6
	---

	# Sharif-wav2vec2

	[Sharif-wav2vec2](https://huggingface.co/SLPL/Sharif-wav2vec2/)

	The base model fine-tuned on 108 hours of Commonvoice on 16kHz sampled speech audio. When using the model
	make sure that your speech input is also sampled at 16Khz.

	#[Paper](https://arxiv.org/abs/2006.11477)

	#Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

	#Abstract

	#We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can #outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and #solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all #labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec #2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of #labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech #recognition with limited amounts of labeled data.

	The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.


	# Usage

	To transcribe Persian audio files the model can be used as a standalone acoustic model as follows:

	```python
	from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
	from datasets import load_dataset
	import torch

	# load model and tokenizer
	processor = Wav2Vec2Processor.from_pretrained("SLPL/Sharif-wav2vec2")
	model = Wav2Vec2ForCTC.from_pretrained("SLPL/Sharif-wav2vec2")

	# load dummy dataset and read soundfiles
	# ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

	# tokenize
	input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1

	# retrieve logits
	logits = model(input_values).logits

	# take argmax and decode
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	```

	## Evaluation

	This code snippet shows how to evaluate facebook/wav2vec2-base-960h on LibriSpeech's "clean" and "other" test data.

	```python
	from datasets import load_dataset
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
	import torch
	from jiwer import wer


	librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

	model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
	processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

	def map_to_pred(batch):
	input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
	with torch.no_grad():
	logits = model(input_values.to("cuda")).logits

	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	batch["transcription"] = transcription
	return batch

	result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

	print("WER:", wer(result["text"], result["transcription"]))
	```

	Result (WER):

	\| "clean" \| "other" \|
	\|---\|---\|
	\| 3.4 \| 8.6 \|