--- language: en datasets: - LIUM/tedlium tags: - speech license: apache-2.0 --- # Wav2Vec2-Large-Tedlium The Wav2Vec2 large model fine-tuned on the TEDLIUM corpus. The model is initialised with Facebook's [Wav2Vec2 large LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint pre-trained on 60,000h of audiobooks from the LibriVox project. It is fine-tuned on 452h of TED talks from the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (Release 3). When using the model, make sure that your speech input is sampled at 16Khz. The model achieves a word error rate (WER) of 8.4% on the dev set and 8.2% on the test set. [Training logs](https://wandb.ai/sanchit-gandhi/tedlium/runs/10c85yc4?workspace=user-sanchit-gandhi) document the training and evaluation progress over 50k steps of fine-tuning. See [this notebook](https://colab.research.google.com/drive/1FjTsqbYKphl9kL-eILgUc-bl4zVThL8F?usp=sharing) for more information on how this model was fine-tuned. # Usage To transcribe audio files the model can be used as a standalone acoustic model as follows: ```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import load_dataset import torch # load model and processor processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium") model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium") # load dummy dataset ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation") # process audio inputs input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 # retrieve logits logits = model(input_values).logits # take argmax and decode predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) print("Target: ", ds["text"][0]) print("Transcription: ", transcription[0]) ``` ## Evaluation This code snippet shows how to evaluate **Wav2Vec2-Large-Tedlium** on the TEDLIUM test data. ```python from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test") model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda") processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"]) print("WER:", wer(result["text"], result["transcription"])) ```