first commit

210657e almost 2 years ago

4.47 kB

	---
	language: en
	datasets:
	- librispeech_asr
	tags:
	- audio
	- automatic-speech-recognition
	- hf-asr-leaderboard
	widget:
	- example_title: Librispeech sample 1
	src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
	- example_title: Librispeech sample 2
	src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
	model-index:
	- name: ccc-wav2vec2-360h-base-100h
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: LibriSpeech (clean)
	type: librispeech_asr
	config: clean
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 10.8
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: LibriSpeech (other)
	type: librispeech_asr
	config: other
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 27.7
	---

	# ccc-Wav2Vec2-360h-Base-100h

	The base model pretrained on 360 hours of Librispeech and fine-tuned on 100 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

	[Paper](https://arxiv.org/abs/2210.02592)

	Authors: Vasista Sai Lodagala, Sreyan Ghosh, S. Umesh

	Abstract
	While Self-Supervised Learning has helped reap the benefit of the scale from the available unlabeled data, the learning paradigms are continuously being bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation and vice-versa, bringing robustness to the pre-training strategy. ccc-wav2vec 2.0 achieves up to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on the test-clean and test-other sets, respectively, of LibriSpeech, without the use of any language model. The proposed method also achieves up to 14.9% relative WER improvement over the baseline wav2vec 2.0 when fine-tuned on Switchboard data.
	GitHub Page: https://github.com/speech-lab-iitm/ccc-wav2vec-2.0.


	# Usage

	To transcribe audio files the model can be used as a standalone acoustic model as follows:

	```python
	from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
	from datasets import load_dataset
	import torch

	# load model and tokenizer
	processor = Wav2Vec2Processor.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h")
	model = Wav2Vec2ForCTC.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h")

	# load dummy dataset and read soundfiles
	ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

	# tokenize
	input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1

	# retrieve logits
	logits = model(input_values).logits

	# take argmax and decode
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	```

	## Evaluation

	This code snippet shows how to evaluate vasista22/ccc-wav2vec2-360h-base-100h on LibriSpeech's "clean" and "other" test data.

	```python
	from datasets import load_dataset
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
	import torch
	from jiwer import wer


	librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

	model = Wav2Vec2ForCTC.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h").to("cuda")
	processor = Wav2Vec2Processor.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h")

	def map_to_pred(batch):
	input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
	with torch.no_grad():
	logits = model(input_values.to("cuda")).logits

	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	batch["transcription"] = transcription
	return batch

	result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

	print("WER:", wer(result["text"], result["transcription"]))
	```

	Result (WER):

	\| "clean" \| "other" \|
	\|---\|---\|
	\| 10.8 \| 27.7 \|