mishig's picture
mishig HF staff
Upload README.md
a36185c
metadata
language: ka
datasets:
  - common_voice
tags:
  - audio
  - automatic-speech-recognition
  - speech
  - xlsr-fine-tuning-week
license: apache-2.0
widget:
  - example_title: Common Voice sample 566
    src: >-
      https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-georgian/resolve/main/sample566.flac
  - example_title: Common Voice sample 95
    src: >-
      https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-georgian/resolve/main/sample95.flac
model-index:
  - name: XLSR Wav2Vec2 Georgian by Mehrdad Farahani
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice ka
          type: common_voice
          args: ka
        metrics:
          - name: Test WER
            type: wer
            value: 43.86

Wav2Vec2-Large-XLSR-53-Georgian

Fine-tuned facebook/wav2vec2-large-xlsr-53 in Georgian using Common Voice. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly (without a language model) as follows:

Requirements

# requirement packages
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa
!pip install jiwer

Normalizer

!wget -O normalizer.py https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-lithuanian/raw/main/normalizer.py

Prediction

import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset

import numpy as np
import re
import string

import IPython.display as ipd

from normalizer import normalizer


def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)

    batch["speech"] = speech_array
    return batch


def predict(batch):
    features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits 
        
    pred_ids = torch.argmax(logits, dim=-1)

    batch["predicted"] = processor.batch_decode(pred_ids)[0]
    return batch


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-georgian")
model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-georgian").to(device)

dataset = load_dataset("common_voice", "ka", split="test[:1%]")
dataset = dataset.map(
    normalizer, 
    fn_kwargs={"remove_extra_space": True},
    remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))
)

dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)

max_items = np.random.randint(0, len(result), 20).tolist()
for i in max_items:
    reference, predicted =  result["sentence"][i], result["predicted"][i]
    print("reference:", reference)
    print("predicted:", predicted)
    print('---')

Output:

reference: แƒžแƒ แƒ”แƒ–แƒ˜แƒ“แƒ”แƒœแƒขแƒแƒ‘แƒ˜แƒกแƒแƒก แƒ‘แƒฃแƒจแƒ˜ แƒกแƒแƒฅแƒแƒ แƒ—แƒ•แƒ”แƒšแƒแƒก แƒ“แƒ แƒฃแƒ™แƒ แƒแƒ˜แƒœแƒ˜แƒก แƒ“แƒ”แƒ›แƒแƒ™แƒ แƒแƒขแƒ˜แƒฃแƒš แƒ›แƒแƒซแƒ แƒแƒแƒ‘แƒ”แƒ‘แƒ˜แƒก แƒ“แƒ แƒœแƒแƒขแƒแƒจแƒ˜ แƒ’แƒแƒฌแƒ”แƒ•แƒ แƒ˜แƒแƒœแƒ”แƒ‘แƒ˜แƒก แƒแƒฅแƒขแƒ˜แƒฃแƒ แƒ˜ แƒ›แƒฎแƒแƒ แƒ“แƒแƒ›แƒญแƒ”แƒ แƒ˜ แƒ˜แƒงแƒ
predicted: แƒžแƒ แƒ”แƒ–แƒ˜แƒ“แƒ”แƒœแƒขแƒ แƒ•แƒ˜แƒกแƒแƒก แƒ‘แƒฃแƒจแƒ˜ แƒกแƒแƒฅแƒแƒ แƒ—แƒ•แƒ”แƒšแƒแƒก แƒ“แƒ แƒฃแƒ™แƒ แƒแƒ˜แƒœแƒ˜แƒก แƒ“แƒ”แƒ›แƒแƒ™แƒ แƒแƒขแƒ˜แƒฃแƒš แƒ›แƒแƒซแƒ แƒแƒแƒ‘แƒ”แƒ‘แƒ˜แƒก แƒ“แƒ แƒœแƒแƒขแƒ˜แƒจแƒ˜ แƒ“แƒแƒฌแƒ”แƒ•แƒ แƒ˜แƒแƒœแƒ”แƒ‘แƒ˜แƒก แƒแƒฅแƒขแƒ˜แƒฃแƒ แƒ˜ แƒ›แƒฎแƒแƒ แƒ“แƒแƒ›แƒญแƒ”แƒ แƒ˜ แƒ˜แƒงแƒ
---
reference: แƒจแƒ”แƒกแƒแƒซแƒšแƒ”แƒ‘แƒ”แƒšแƒ˜แƒ แƒ›แƒ˜แƒกแƒ˜ แƒ“แƒแƒ›แƒแƒœแƒ”แƒ‘แƒ แƒ“แƒ แƒ›แƒกแƒแƒฎแƒฃแƒ  แƒ“แƒ”แƒ›แƒแƒœแƒแƒ“ แƒ’แƒแƒ“แƒแƒฅแƒชแƒ”แƒ•แƒ
predicted: แƒจแƒ”แƒกแƒแƒซแƒšแƒ”แƒ‘แƒ”แƒšแƒ˜แƒ แƒ›แƒ˜แƒกแƒ˜ แƒ“แƒแƒ›แƒแƒœแƒ”แƒ‘แƒแƒ— แƒ“แƒ แƒ›แƒกแƒแƒฎแƒฃแƒ แƒ“แƒ”แƒ›แƒแƒœแƒแƒ“ แƒ’แƒแƒ“แƒแƒฅแƒชแƒ”แƒ•แƒ
---
reference: แƒ”แƒก แƒ’แƒแƒ›แƒแƒกแƒแƒฎแƒฃแƒšแƒ”แƒ‘แƒ”แƒ‘แƒ˜ แƒแƒฆแƒ‘แƒ”แƒญแƒ“แƒ˜แƒšแƒ˜ แƒ˜แƒงแƒ แƒ›แƒแƒกแƒ™แƒแƒ•แƒ˜แƒก แƒ“แƒ˜แƒ“แƒ˜ แƒ›แƒ—แƒแƒ•แƒ แƒ”แƒ‘แƒ˜แƒกแƒ แƒ“แƒ แƒ›แƒ”แƒคแƒ”แƒ”แƒ‘แƒ˜แƒก แƒ‘แƒ”แƒญแƒ“แƒ”แƒ‘แƒ–แƒ”
predicted: แƒ”แƒก แƒ’แƒแƒ›แƒแƒกแƒแƒฎแƒฃแƒšแƒ”แƒ‘แƒ”แƒ‘แƒ˜ แƒแƒฆแƒ‘แƒ”แƒญแƒ“แƒ˜แƒšแƒ˜ แƒ˜แƒงแƒ แƒ›แƒแƒกแƒ™แƒแƒ•แƒ˜แƒก แƒ“แƒ˜แƒ“แƒ˜ แƒ›แƒ—แƒแƒ•แƒ แƒ”แƒ‘แƒ˜แƒกแƒ แƒ“แƒ แƒ›แƒ”แƒคแƒ”แƒ”แƒ‘แƒ˜แƒก แƒ‘แƒ”แƒญแƒ“แƒ”แƒ‘แƒ–แƒ”
---
reference: แƒฏแƒแƒšแƒ˜แƒ› แƒแƒฅแƒ แƒแƒก แƒ’แƒšแƒแƒ‘แƒฃแƒกแƒ˜แƒกแƒ แƒ“แƒ แƒ™แƒ˜แƒœแƒแƒ›แƒกแƒแƒฎแƒ˜แƒแƒ‘แƒ—แƒ แƒ’แƒ˜แƒšแƒ“แƒ˜แƒ˜แƒก แƒœแƒแƒ›แƒ˜แƒœแƒแƒชแƒ˜แƒ”แƒ‘แƒ˜ แƒ›แƒ˜แƒ˜แƒฆแƒ
predicted: แƒฏแƒแƒšแƒ˜ แƒ›แƒแƒฅแƒ แƒแƒก แƒ’แƒšแƒแƒ‘แƒฃแƒกแƒ˜แƒกแƒ แƒ“แƒ แƒ™แƒ˜แƒœแƒแƒ›แƒกแƒแƒฎแƒ˜แƒแƒ‘แƒ—แƒ แƒ’แƒ˜แƒšแƒ“แƒ˜แƒ˜แƒก แƒœแƒแƒ›แƒ˜แƒœแƒแƒชแƒ˜แƒ”แƒ‘แƒ˜ แƒ›แƒ˜แƒ˜แƒฆแƒ
---
reference: แƒจแƒ”แƒ›แƒ“แƒ’แƒแƒ›แƒจแƒ˜ แƒกแƒแƒฅแƒแƒšแƒแƒฅแƒ แƒ‘แƒ˜แƒ‘แƒšแƒ˜แƒแƒ—แƒ”แƒ™แƒ แƒกแƒแƒ แƒแƒ˜แƒแƒœแƒ แƒ‘แƒ˜แƒ‘แƒšแƒ˜แƒแƒ—แƒ”แƒ™แƒแƒ“ แƒ’แƒแƒ“แƒแƒ™แƒ”แƒ—แƒ“แƒ แƒ’แƒแƒ˜แƒ–แƒแƒ แƒ“แƒ แƒฌแƒ˜แƒ’แƒœแƒแƒ“แƒ˜ แƒคแƒแƒœแƒ“แƒ˜
predicted: แƒจแƒ”แƒ›แƒ“แƒฆแƒแƒ›แƒจแƒ˜ แƒกแƒแƒฅแƒแƒšแƒแƒฅแƒ แƒ‘แƒ˜แƒ‘แƒšแƒ˜แƒแƒ—แƒ”แƒ™แƒ แƒกแƒแƒ แƒแƒ˜แƒแƒœแƒ แƒ‘แƒ˜แƒ‘แƒšแƒ˜แƒแƒ—แƒ”แƒ™แƒแƒ“ แƒ’แƒแƒ“แƒแƒ™แƒ”แƒ—แƒ แƒ’แƒแƒ˜แƒ–แƒแƒ แƒ“แƒ แƒฌแƒ˜แƒ’แƒœแƒแƒ“แƒ˜ แƒคแƒแƒ•แƒ“แƒ˜
---
reference: แƒแƒ‘แƒ แƒแƒ›แƒกแƒ˜ แƒ“แƒแƒฃแƒ™แƒแƒ•แƒจแƒ˜แƒ แƒ“แƒ แƒ›แƒ˜แƒ แƒแƒœแƒ“แƒแƒก แƒ“แƒ แƒแƒ แƒ˜ แƒ—แƒ•แƒ˜แƒก แƒ’แƒแƒœแƒ›แƒแƒ•แƒšแƒแƒ‘แƒแƒจแƒ˜ แƒ˜แƒกแƒ˜แƒœแƒ˜ แƒ›แƒฃแƒจแƒแƒแƒ‘แƒ“แƒœแƒ”แƒœ แƒแƒฆแƒœแƒ˜แƒจแƒœแƒฃแƒšแƒ˜ แƒกแƒชแƒ”แƒœแƒ˜แƒก แƒ—แƒแƒœแƒ›แƒฎแƒšแƒ”แƒ‘ แƒ›แƒ”แƒšแƒแƒ“แƒ˜แƒแƒ–แƒ”
predicted: แƒแƒ‘แƒ แƒแƒ›แƒจแƒ˜ แƒ“แƒ แƒฃแƒ™แƒแƒ•แƒจแƒ˜แƒ แƒ“แƒ แƒ›แƒ˜แƒ แƒแƒœแƒ“แƒ”แƒก แƒ“แƒ แƒแƒ แƒ˜แƒ—แƒ•แƒ˜แƒก แƒ’แƒแƒœแƒ›แƒแƒ•แƒšแƒแƒ‘แƒแƒจแƒ˜ แƒ˜แƒกแƒ˜แƒœแƒ˜ แƒ›แƒฃแƒจแƒแƒแƒ‘แƒ“แƒœแƒ”แƒœแƒ แƒแƒฆแƒœแƒ˜แƒจแƒœแƒฃแƒšแƒ˜แƒก แƒฉแƒ”แƒœแƒ˜แƒก แƒ›แƒ—แƒแƒ›แƒฎแƒšแƒ”แƒ•แƒ˜แƒ— แƒ›แƒ”แƒšแƒแƒ“แƒ˜แƒแƒจแƒ˜
---
reference: แƒแƒ›แƒŸแƒแƒ›แƒแƒ“ แƒ—แƒ”แƒ›แƒ—แƒ แƒžแƒแƒšแƒแƒขแƒ˜แƒก แƒแƒžแƒแƒ–แƒ˜แƒชแƒ˜แƒ˜แƒก แƒšแƒ˜แƒ“แƒ”แƒ แƒ˜แƒ แƒšแƒ”แƒ˜แƒ‘แƒแƒ แƒ˜แƒกแƒขแƒฃแƒšแƒ˜ แƒžแƒแƒ แƒขแƒ˜แƒ˜แƒก แƒšแƒ˜แƒ“แƒ”แƒ แƒ˜ แƒฏแƒ”แƒ แƒ”แƒ›แƒ˜ แƒ™แƒแƒ แƒ‘แƒ˜แƒœแƒ˜
predicted: แƒแƒ›แƒŸแƒแƒ›แƒแƒ“ แƒ—แƒ”แƒ›แƒ—แƒ แƒžแƒแƒšแƒแƒขแƒ˜แƒก แƒแƒžแƒแƒ–แƒ˜แƒชแƒ˜แƒ˜แƒก แƒšแƒ˜แƒ“แƒ”แƒ แƒ˜แƒ แƒšแƒ”แƒ˜แƒ‘แƒฃแƒ แƒ˜แƒกแƒขแƒฃแƒšแƒ˜ แƒžแƒแƒ แƒขแƒ˜แƒ˜แƒก แƒšแƒ˜แƒ“แƒ”แƒ แƒ˜ แƒฏแƒ”แƒ แƒ”แƒ›แƒ˜ แƒ™แƒแƒ แƒ•แƒ˜แƒœแƒ˜
---
reference: แƒแƒ แƒ˜
predicted: แƒแƒ แƒ˜
---
reference: แƒ›แƒแƒก แƒจแƒ”แƒ›แƒ“แƒ”แƒ’ แƒ˜แƒ’แƒ˜ แƒ™แƒแƒšแƒ”แƒฅแƒขแƒ˜แƒ•แƒ˜แƒก แƒ›แƒฃแƒ“แƒ›แƒ˜แƒ•แƒ˜ แƒฌแƒ”แƒ•แƒ แƒ˜แƒ
predicted: แƒ›แƒแƒก แƒจแƒ”แƒ›แƒ“แƒ”แƒ’ แƒ˜แƒ’แƒ˜ แƒ™แƒแƒšแƒ”แƒฅแƒขแƒ˜แƒ•แƒ˜แƒก แƒคแƒฃแƒ“ แƒ›แƒ˜แƒ•แƒ˜ แƒฌแƒ”แƒ•แƒ แƒ˜แƒ
---
reference: แƒแƒ–แƒ”แƒ แƒ‘แƒแƒ˜แƒฏแƒแƒœแƒฃแƒš แƒคแƒ˜แƒšแƒแƒกแƒแƒคแƒ˜แƒแƒก แƒจแƒ”แƒ˜แƒซแƒšแƒ”แƒ‘แƒ แƒ›แƒ˜แƒ•แƒแƒ™แƒฃแƒ—แƒ•แƒœแƒแƒ— แƒ แƒฃแƒกแƒ”แƒ—แƒ˜แƒก แƒกแƒแƒ–แƒแƒ’แƒแƒ“แƒ แƒ›แƒแƒฆแƒ•แƒแƒฌแƒ” แƒฐแƒ”แƒ˜แƒ“แƒแƒ  แƒฏแƒ”แƒ›แƒแƒšแƒ˜
predicted: แƒแƒ–แƒ”แƒ แƒ’แƒ•แƒแƒ˜แƒฏแƒแƒœแƒแƒš แƒคแƒ˜แƒšแƒแƒกแƒแƒคแƒ˜แƒแƒก แƒจแƒ”แƒ˜แƒซแƒšแƒ”แƒ‘แƒ แƒ›แƒ˜แƒ•แƒแƒ™แƒฃแƒ—แƒ•แƒœแƒแƒ— แƒ แƒฃแƒกแƒ”แƒ—แƒ˜แƒก แƒกแƒแƒ–แƒแƒ’แƒแƒ“แƒ แƒ›แƒแƒฆแƒ•แƒแƒฌแƒ” แƒฐแƒ”แƒ˜แƒ“แƒแƒ  แƒฏแƒ”แƒ›แƒแƒšแƒ˜
---
reference: แƒ‘แƒ แƒแƒœแƒฅแƒกแƒจแƒ˜ แƒฏแƒ”แƒ แƒแƒ›แƒ˜แƒก แƒแƒ•แƒ”แƒœแƒ˜แƒฃ แƒฐแƒงแƒแƒคแƒก แƒ’แƒแƒ›แƒญแƒแƒš แƒฅแƒฃแƒฉแƒ”แƒ‘แƒก แƒแƒฆแƒ›แƒแƒกแƒแƒ•แƒšแƒ”แƒ— แƒ“แƒ แƒ“แƒแƒกแƒแƒ•แƒšแƒ”แƒ— แƒœแƒแƒฌแƒ˜แƒšแƒ”แƒ‘แƒแƒ“
predicted: แƒ แƒแƒœแƒ’แƒจแƒ˜ แƒ“แƒ”แƒ แƒแƒ›แƒ˜แƒฌ แƒแƒ•แƒ”แƒœแƒ˜แƒš แƒžแƒแƒคแƒก แƒ’แƒแƒ› แƒ“แƒแƒšแƒคแƒฃแƒ แƒฅแƒ”แƒ‘แƒก แƒแƒฆแƒ›แƒแƒกแƒแƒ•แƒšแƒ”แƒ— แƒ“แƒ แƒ“แƒแƒกแƒแƒ•แƒšแƒ”แƒ— แƒœแƒแƒฌแƒ˜แƒšแƒ”แƒ‘แƒแƒ“
---
reference: แƒฐแƒแƒ”แƒ แƒ˜ แƒแƒ แƒ˜แƒก แƒŸแƒแƒœแƒ’แƒ‘แƒแƒ“แƒ˜แƒก แƒ˜แƒก แƒซแƒ˜แƒ แƒ˜แƒ—แƒแƒ“แƒ˜ แƒฌแƒงแƒแƒ แƒ แƒ แƒแƒ›แƒ”แƒšแƒกแƒแƒช แƒกแƒแƒญแƒ˜แƒ แƒแƒ”แƒ‘แƒก แƒงแƒ•แƒ”แƒšแƒ แƒชแƒแƒชแƒฎแƒแƒšแƒ˜ แƒแƒ แƒ’แƒแƒœแƒ˜แƒ–แƒ›แƒ˜
predicted: แƒแƒ แƒ˜ แƒแƒ แƒ˜แƒก แƒฏแƒแƒ›แƒฃแƒ‘แƒแƒ“แƒ”แƒกแƒ˜แƒก แƒซแƒ˜แƒ แƒ˜แƒ—แƒแƒ“แƒ˜ แƒฌแƒงแƒแƒ แƒ แƒ แƒแƒ›แƒ”แƒšแƒกแƒแƒช แƒกแƒแƒญแƒ˜แƒ แƒแƒแƒ”แƒ‘แƒก แƒงแƒ•แƒ”แƒšแƒ แƒชแƒแƒชแƒฎแƒแƒšแƒ˜ แƒแƒ แƒ’แƒแƒœแƒ˜แƒ–แƒ›แƒ˜
---
reference: แƒฏแƒ’แƒฃแƒคแƒ˜ แƒฃแƒ›แƒ”แƒขแƒ”แƒกแƒฌแƒ˜แƒšแƒแƒ“ แƒแƒกแƒ แƒฃแƒšแƒ”แƒ‘แƒก แƒžแƒแƒžแƒ›แƒฃแƒกแƒ˜แƒ™แƒ˜แƒก แƒŸแƒแƒœแƒ แƒ˜แƒก แƒกแƒ˜แƒ›แƒฆแƒ”แƒ แƒ”แƒ‘แƒก
predicted: แƒฏแƒ’แƒฃแƒคแƒ˜แƒฃแƒ›แƒ”แƒขแƒ”แƒกแƒฌแƒ”แƒ•แƒแƒ“ แƒแƒกแƒ แƒฃแƒšแƒ”แƒ‘แƒก แƒžแƒแƒžแƒœแƒฃแƒกแƒ˜แƒ™แƒ˜แƒก แƒŸแƒแƒœแƒ แƒ˜แƒก แƒกแƒ˜แƒ›แƒ แƒ”แƒ แƒ”แƒ‘แƒก
---
reference: แƒ‘แƒแƒ‘แƒ˜แƒšแƒ˜แƒœแƒ แƒ›แƒฃแƒ“แƒ›แƒ˜แƒ•แƒแƒ“ แƒชแƒ“แƒ˜แƒšแƒแƒ‘แƒ“แƒ แƒจแƒ”แƒกแƒแƒซแƒšแƒ”แƒ‘แƒšแƒแƒ‘แƒ”แƒ‘แƒ˜แƒก แƒคแƒแƒ แƒ’แƒšแƒ”แƒ‘แƒจแƒ˜ แƒ›แƒ˜แƒ”แƒฆแƒ แƒชแƒแƒ“แƒœแƒ แƒ“แƒ แƒแƒฎแƒแƒšแƒ˜ แƒ˜แƒœแƒคแƒแƒ แƒ›แƒแƒชแƒ˜แƒ
predicted: แƒ‘แƒแƒ‘แƒ˜แƒšแƒ˜แƒœแƒ แƒ›แƒฃแƒ“แƒ›แƒ˜แƒ•แƒ แƒชแƒ“แƒ˜แƒšแƒแƒ‘แƒ“แƒ แƒจแƒ”แƒกแƒแƒซแƒšแƒ”แƒ‘แƒšแƒแƒ‘แƒ”แƒ‘แƒ˜แƒก แƒคแƒแƒ แƒ’แƒšแƒ”แƒ‘แƒจแƒ˜ แƒ›แƒ˜แƒ˜แƒฆแƒ แƒชแƒแƒขแƒœแƒ แƒ“แƒ แƒแƒฎแƒแƒšแƒ˜ แƒ˜แƒœแƒคแƒแƒ แƒ›แƒแƒชแƒ˜แƒ
---
reference: แƒ›แƒ แƒ”แƒ•แƒšแƒ˜แƒก แƒ แƒฌแƒ›แƒ”แƒœแƒ˜แƒ— แƒ แƒแƒ›แƒ”แƒšแƒ˜ แƒฏแƒ’แƒฃแƒคแƒ˜แƒช แƒ’แƒแƒ˜แƒ›แƒแƒ แƒฏแƒ•แƒ”แƒ‘แƒ“แƒ แƒ›แƒ—แƒ”แƒšแƒ˜ แƒฌแƒšแƒ˜แƒก แƒ›แƒแƒœแƒซแƒ˜แƒšแƒ–แƒ” แƒกแƒ˜แƒฃแƒฎแƒ•แƒ” แƒ“แƒ แƒ‘แƒแƒ แƒแƒฅแƒ แƒแƒ  แƒ›แƒแƒแƒ™แƒšแƒ“แƒ”แƒ‘แƒแƒ“แƒ
predicted: แƒ›แƒ แƒ”แƒ•แƒ แƒ˜แƒก แƒ แƒฌแƒ›แƒ”แƒœแƒ˜แƒ— แƒ แƒแƒ›แƒ”แƒšแƒ˜แƒฏแƒ’แƒฃแƒคแƒ˜แƒก แƒ’แƒแƒ˜แƒ›แƒแƒ แƒฏแƒ•แƒ”แƒ‘แƒ“แƒ แƒ›แƒ—แƒ”แƒšแƒ˜แƒญแƒšแƒ˜แƒก แƒ›แƒแƒœแƒซแƒ˜แƒšแƒ–แƒ แƒกแƒ˜แƒฃแƒงแƒ•แƒ”แƒขแƒแƒ‘แƒแƒ แƒแƒฅแƒ แƒแƒ  แƒ›แƒแƒแƒ™แƒšแƒ“แƒ”แƒ‘แƒแƒ“แƒ
---
reference: แƒœแƒ˜แƒœแƒ แƒฉแƒฎแƒ”แƒ˜แƒซแƒ”แƒก แƒ’แƒแƒœแƒกแƒแƒ™แƒฃแƒ—แƒ แƒ”แƒ‘แƒฃแƒšแƒ˜ แƒฆแƒ•แƒแƒฌแƒšแƒ˜ แƒ›แƒ˜แƒฃแƒซแƒฆแƒ•แƒ˜แƒก แƒฅแƒฃแƒ—แƒแƒ˜แƒกแƒ˜แƒกแƒ แƒ“แƒ แƒ แƒฃแƒกแƒ—แƒแƒ•แƒ”แƒšแƒ˜แƒก แƒ—แƒ”แƒแƒขแƒ แƒ”แƒ‘แƒ˜แƒก แƒจแƒ”แƒ›แƒแƒฅแƒ›แƒ”แƒ“แƒ”แƒ‘แƒ˜แƒ— แƒชแƒฎแƒแƒ•แƒ แƒ”แƒ‘แƒแƒจแƒ˜
predicted: แƒ›แƒ˜แƒœแƒ แƒฉแƒฎแƒ”แƒ˜แƒซแƒ”แƒก แƒ’แƒแƒœแƒกแƒแƒ™แƒฃแƒ—แƒ แƒ”แƒ‘แƒฃแƒšแƒ˜ แƒฆแƒแƒ•แƒแƒฌแƒšแƒ˜ แƒ›แƒ˜แƒแƒชแƒฎแƒ•แƒ˜แƒก แƒฅแƒฃแƒ—แƒแƒ˜แƒกแƒ˜แƒกแƒ แƒ“แƒ แƒ แƒฃแƒกแƒ—แƒแƒ•แƒ”แƒšแƒ˜แƒก แƒ—แƒ”แƒแƒขแƒ แƒ”แƒ‘แƒ˜แƒก แƒจแƒ”แƒ›แƒแƒฅแƒ›แƒ”แƒ“แƒ”แƒ‘แƒ˜แƒ— แƒชแƒฎแƒแƒ•แƒ แƒ”แƒ‘แƒแƒจแƒ˜
---
reference: แƒ˜แƒ’แƒ˜ แƒกแƒแƒ›แƒ˜ แƒ“แƒ˜แƒแƒšแƒ”แƒฅแƒขแƒ˜แƒกแƒ’แƒแƒœ แƒจแƒ”แƒ“แƒ’แƒ”แƒ‘แƒ
predicted: แƒ˜แƒ’แƒ˜ แƒกแƒแƒ›แƒ˜ แƒ“แƒ˜แƒแƒšแƒ”แƒ—แƒ˜แƒก แƒ’แƒแƒœ แƒจแƒ”แƒ“แƒ’แƒ”แƒ‘แƒ
---
reference: แƒคแƒแƒ แƒ›แƒ˜แƒ— แƒกแƒ˜แƒ แƒแƒฅแƒšแƒ”แƒ›แƒ”แƒ‘แƒก แƒฌแƒแƒแƒ’แƒ•แƒแƒœแƒแƒœ
predicted: แƒแƒ›แƒ˜แƒชแƒ˜ แƒ แƒแƒฅแƒšแƒ”แƒ›แƒ”แƒ‘แƒก แƒแƒแƒ’แƒ•แƒแƒœแƒแƒ›
---
reference: แƒ“แƒแƒœแƒ˜ แƒ“แƒแƒ˜แƒ‘แƒแƒ“แƒ แƒ™แƒแƒšแƒฃแƒ›แƒ‘แƒฃแƒกแƒจแƒ˜ แƒแƒฐแƒแƒ˜แƒแƒจแƒ˜
predicted: แƒ“แƒแƒœแƒ˜ แƒ“แƒแƒ˜แƒ‘แƒแƒแƒ“แƒ แƒ™แƒแƒšแƒฃแƒ›แƒ‘แƒฃแƒกแƒจแƒ˜ แƒแƒฎแƒ•แƒแƒ˜แƒแƒจแƒ˜
---
reference: แƒ›แƒจแƒ”แƒœแƒ”แƒ‘แƒšแƒแƒ‘แƒ˜แƒกแƒแƒ—แƒ•แƒ˜แƒก แƒ’แƒแƒ›แƒแƒ˜แƒงแƒ แƒแƒ“แƒ’แƒ˜แƒšแƒ˜ แƒงแƒแƒคแƒ˜แƒšแƒ˜ แƒแƒ”แƒ แƒแƒžแƒแƒ แƒขแƒ˜แƒก แƒ แƒแƒ˜แƒแƒœแƒจแƒ˜
predicted: แƒจแƒ”แƒœแƒ”แƒ‘แƒšแƒแƒ‘แƒ˜แƒกแƒแƒ—แƒ•แƒ˜แƒก แƒ’แƒแƒ›แƒแƒ˜แƒงแƒ แƒแƒ“แƒ’แƒ˜แƒšแƒ˜ แƒงแƒแƒคแƒ˜แƒšแƒ˜ แƒแƒ”แƒ แƒแƒžแƒแƒ แƒขแƒ˜แƒก แƒ แƒแƒ˜แƒแƒœแƒจแƒ˜
---

Evaluation

The model can be evaluated as follows on the Georgian test data of Common Voice.

import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset, load_metric

import numpy as np
import re
import string

from normalizer import normalizer


def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)

    batch["speech"] = speech_array
    return batch


def predict(batch):
    features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits 
        
    pred_ids = torch.argmax(logits, dim=-1)

    batch["predicted"] = processor.batch_decode(pred_ids)[0]
    return batch


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-georgian")
model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-georgian").to(device)

dataset = load_dataset("common_voice", "ka", split="test")
dataset = dataset.map(
    normalizer, 
    fn_kwargs={"remove_extra_space": True},
    remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))
)

dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)

wer = load_metric("wer")

print("WER: {:.2f}".format(100 * wer.compute(predictions=result["predicted"], references=result["sentence"])))

Test Result:

  • WER: 43.86%

Training & Report

The Common Voice train, validation datasets were used for training.

You can see the training states here

The script used for training can be found here

Questions?

Post a Github issue on the Wav2Vec repo.