|
--- |
|
language: en |
|
datasets: |
|
- librispeech_asr |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
widget: |
|
- example_title: Librispeech sample 1 |
|
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
|
- example_title: Librispeech sample 2 |
|
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
|
model-index: |
|
- name: ccc-wav2vec2-360h-base-100h |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: LibriSpeech (clean) |
|
type: librispeech_asr |
|
config: clean |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 10.8 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: LibriSpeech (other) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 27.7 |
|
--- |
|
|
|
# ccc-Wav2Vec2-360h-Base-100h |
|
|
|
The base model pretrained on 360 hours of Librispeech and fine-tuned on 100 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. |
|
|
|
[Paper](https://arxiv.org/abs/2210.02592) |
|
|
|
Authors: Vasista Sai Lodagala, Sreyan Ghosh, S. Umesh |
|
|
|
**Abstract** |
|
While Self-Supervised Learning has helped reap the benefit of the scale from the available unlabeled data, the learning paradigms are continuously being bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation and vice-versa, bringing robustness to the pre-training strategy. ccc-wav2vec 2.0 achieves up to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on the test-clean and test-other sets, respectively, of LibriSpeech, without the use of any language model. The proposed method also achieves up to 14.9% relative WER improvement over the baseline wav2vec 2.0 when fine-tuned on Switchboard data. |
|
GitHub Page: https://github.com/speech-lab-iitm/ccc-wav2vec-2.0. |
|
|
|
|
|
# Usage |
|
|
|
To transcribe audio files the model can be used as a standalone acoustic model as follows: |
|
|
|
```python |
|
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC |
|
from datasets import load_dataset |
|
import torch |
|
|
|
# load model and tokenizer |
|
processor = Wav2Vec2Processor.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h") |
|
model = Wav2Vec2ForCTC.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h") |
|
|
|
# load dummy dataset and read soundfiles |
|
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") |
|
|
|
# tokenize |
|
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 |
|
|
|
# retrieve logits |
|
logits = model(input_values).logits |
|
|
|
# take argmax and decode |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
``` |
|
|
|
## Evaluation |
|
|
|
This code snippet shows how to evaluate **vasista22/ccc-wav2vec2-360h-base-100h** on LibriSpeech's "clean" and "other" test data. |
|
|
|
```python |
|
from datasets import load_dataset |
|
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor |
|
import torch |
|
from jiwer import wer |
|
|
|
|
|
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") |
|
|
|
model = Wav2Vec2ForCTC.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h").to("cuda") |
|
processor = Wav2Vec2Processor.from_pretrained("vasista22/ccc-wav2vec2-360h-base-100h") |
|
|
|
def map_to_pred(batch): |
|
input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values |
|
with torch.no_grad(): |
|
logits = model(input_values.to("cuda")).logits |
|
|
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
batch["transcription"] = transcription |
|
return batch |
|
|
|
result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) |
|
|
|
print("WER:", wer(result["text"], result["transcription"])) |
|
``` |
|
|
|
*Result (WER)*: |
|
|
|
| "clean" | "other" | |
|
|---|---| |
|
| 10.8 | 27.7 | |