vasista22's picture
first commit
history blame
No virus
4.47 kB
language: en
- librispeech_asr
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
- example_title: Librispeech sample 1
- example_title: Librispeech sample 2
- name: ccc-wav2vec2-360h-base-100h
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
language: en
- name: Test WER
type: wer
value: 10.8
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
language: en
- name: Test WER
type: wer
value: 27.7
# ccc-Wav2Vec2-360h-Base-100h
The base model pretrained on 360 hours of Librispeech and fine-tuned on 100 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.
Authors: Vasista Sai Lodagala, Sreyan Ghosh, S. Umesh
While Self-Supervised Learning has helped reap the benefit of the scale from the available unlabeled data, the learning paradigms are continuously being bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation and vice-versa, bringing robustness to the pre-training strategy. ccc-wav2vec 2.0 achieves up to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on the test-clean and test-other sets, respectively, of LibriSpeech, without the use of any language model. The proposed method also achieves up to 14.9% relative WER improvement over the baseline wav2vec 2.0 when fine-tuned on Switchboard data.
GitHub Page:
# Usage
To transcribe audio files the model can be used as a standalone acoustic model as follows:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h")
model = Wav2Vec2ForCTC.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h")
# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
# retrieve logits
logits = model(input_values).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
## Evaluation
This code snippet shows how to evaluate **vasista22/ccc-wav2vec2-360h-base-100h** on LibriSpeech's "clean" and "other" test data.
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = Wav2Vec2ForCTC.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h")
def map_to_pred(batch):
input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
with torch.no_grad():
logits = model("cuda")).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
batch["transcription"] = transcription
return batch
result =, batched=True, batch_size=1, remove_columns=["audio"])
print("WER:", wer(result["text"], result["transcription"]))
*Result (WER)*:
| "clean" | "other" |
| 10.8 | 27.7 |