cahya
/

wav2vec2-luganda

@@ -1,105 +1,157 @@
 ---
-language:
-- lg
-license: apache-2.0
 tags:
 - automatic-speech-recognition
-- mozilla-foundation/common_voice_7_0
-- generated_from_trainer
-datasets:
 - common_voice
 model-index:
-- name: ''
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-#
-This model is a fine-tuned version of [indonesian-nlp/wav2vec2-luganda](https://huggingface.co/indonesian-nlp/wav2vec2-luganda) on the MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - LG dataset.
-It achieves the following results on the evaluation set:
-- Loss: 8.8279
-- Wer: 1.0123
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-08
-- train_batch_size: 64
-- eval_batch_size: 2
-- seed: 42
-- gradient_accumulation_steps: 4
-- total_train_batch_size: 256
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 10
-- num_epochs: 10.0
-- mixed_precision_training: Native AMP
-### Training results
-| Training Loss | Epoch | Step | Validation Loss | Wer    |
-|:-------------:|:-----:|:----:|:---------------:|:------:|
-| 8.1684        | 0.25  | 10   | 8.8588          | 1.0125 |
-| 8.1428        | 0.5   | 20   | 8.8569          | 1.0125 |
-| 8.1333        | 0.75  | 30   | 8.8552          | 1.0124 |
-| 8.7873        | 1.03  | 40   | 8.8532          | 1.0124 |
-| 8.1298        | 1.28  | 50   | 8.8516          | 1.0124 |
-| 8.1445        | 1.53  | 60   | 8.8499          | 1.0123 |
-| 8.1635        | 1.78  | 70   | 8.8483          | 1.0124 |
-| 8.7587        | 2.05  | 80   | 8.8468          | 1.0125 |
-| 8.1424        | 2.3   | 90   | 8.8454          | 1.0124 |
-| 8.1318        | 2.55  | 100  | 8.8440          | 1.0124 |
-| 8.1469        | 2.81  | 110  | 8.8428          | 1.0125 |
-| 8.7602        | 3.08  | 120  | 8.8416          | 1.0125 |
-| 8.1584        | 3.33  | 130  | 8.8405          | 1.0126 |
-| 8.142         | 3.58  | 140  | 8.8394          | 1.0126 |
-| 8.1285        | 3.83  | 150  | 8.8384          | 1.0124 |
-| 8.7756        | 4.1   | 160  | 8.8371          | 1.0124 |
-| 8.0991        | 4.35  | 170  | 8.8363          | 1.0125 |
-| 8.1442        | 4.6   | 180  | 8.8354          | 1.0124 |
-| 8.1294        | 4.86  | 190  | 8.8346          | 1.0124 |
-| 8.7276        | 5.13  | 200  | 8.8338          | 1.0125 |
-| 8.1439        | 5.38  | 210  | 8.8329          | 1.0124 |
-| 8.1115        | 5.63  | 220  | 8.8322          | 1.0124 |
-| 8.1501        | 5.88  | 230  | 8.8316          | 1.0125 |
-| 8.7143        | 6.15  | 240  | 8.8308          | 1.0124 |
-| 8.143         | 6.4   | 250  | 8.8302          | 1.0124 |
-| 8.1528        | 6.65  | 260  | 8.8300          | 1.0125 |
-| 8.1293        | 6.91  | 270  | 8.8297          | 1.0124 |
-| 8.7519        | 7.18  | 280  | 8.8293          | 1.0125 |
-| 8.1153        | 7.43  | 290  | 8.8289          | 1.0124 |
-| 8.1292        | 7.68  | 300  | 8.8288          | 1.0124 |
-| 8.0904        | 7.93  | 310  | 8.8284          | 1.0124 |
-| 8.7425        | 8.2   | 320  | 8.8283          | 1.0125 |
-| 8.0963        | 8.45  | 330  | 8.8281          | 1.0124 |
-| 8.1112        | 8.7   | 340  | 8.8281          | 1.0124 |
-| 8.124         | 8.96  | 350  | 8.8281          | 1.0125 |
-| 8.7327        | 9.23  | 360  | 8.8279          | 1.0123 |
-| 8.1261        | 9.48  | 370  | 8.8279          | 1.0126 |
-| 8.1259        | 9.73  | 380  | 8.8279          | 1.0124 |
-| 8.1116        | 9.98  | 390  | 8.8279          | 1.0123 |
-### Framework versions
-- Transformers 4.17.0.dev0
-- Pytorch 1.10.2+cu102
-- Datasets 1.18.3
-- Tokenizers 0.11.0

 ---
+language: lg
+datasets:
+- mozilla-foundation/common_voice_7_0
+metrics:
+- wer
 tags:
+- audio
 - automatic-speech-recognition
+- speech
 - common_voice
+- lg
+- robust-speech-event
+license: apache-2.0
 model-index:
+- name: Wav2Vec2 Luganda by Indonesian-NLP
+  results:
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice lg
+      type: common_voice
+      args: lg
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 7.53
+- task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 7
+      type: mozilla-foundation/common_voice_7_0
+      args: tr
+    metrics:
+      - name: Test WER
+        type: wer
+        value: 8.147
+      - name: Test CER
+        type: cer
+        value: 2.802
 ---
+# Automatic Speech Recognition for Luganda
+This is the model built for the
+[Mozilla Luganda Automatic Speech Recognition competition](https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition).
+It is a fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
+model on the [Luganda Common Voice dataset](https://huggingface.co/datasets/common_voice) version 7.0.
+We also provide a [live demo](https://huggingface.co/spaces/indonesian-nlp/luganda-asr) to test the model.
+When using this model, make sure that your speech input is sampled at 16kHz.
+## Usage
+The model can be used directly (without a language model) as follows:
+```python
+import torch
+import torchaudio
+from datasets import load_dataset
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+test_dataset = load_dataset("common_voice", "lg", split="test[:2%]")
+processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-luganda")
+model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-luganda")
+resampler = torchaudio.transforms.Resample(48_000, 16_000)
+# Preprocessing the datasets.
+# We need to read the aduio files as arrays
+def speech_file_to_array_fn(batch):
+    if "audio" in batch:
+        speech_array = torch.tensor(batch["audio"]["array"])
+    else:
+        speech_array, sampling_rate = torchaudio.load(batch["path"])
+    batch["speech"] = resampler(speech_array).squeeze().numpy()
+    return batch
+test_dataset = test_dataset.map(speech_file_to_array_fn)
+inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
+with torch.no_grad():
+    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
+predicted_ids = torch.argmax(logits, dim=-1)
+print("Prediction:", processor.batch_decode(predicted_ids))
+print("Reference:", test_dataset[:2]["sentence"])
+```
+## Evaluation
+The model can be evaluated as follows on the Indonesian test data of Common Voice.
+```python
+import torch
+import torchaudio
+from datasets import load_dataset, load_metric
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+import re
+test_dataset = load_dataset("common_voice", "lg", split="test")
+wer = load_metric("wer")
+processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-luganda")
+model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-luganda")
+model.to("cuda")
+chars_to_ignore = [",", "?", ".", "!", "-", ";", ":", '""', "%", "'", '"', "�", "‘", "’", "’"]
+chars_to_ignore_regex = f'[{"".join(chars_to_ignore)}]'
+resampler = torchaudio.transforms.Resample(48_000, 16_000)
+# Preprocessing the datasets.
+# We need to read the audio files as arrays
+def speech_file_to_array_fn(batch):
+    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
+    if "audio" in batch:
+        speech_array = torch.tensor(batch["audio"]["array"])
+    else:
+        speech_array, sampling_rate = torchaudio.load(batch["path"])
+    batch["speech"] = resampler(speech_array).squeeze().numpy()
+    return batch
+test_dataset = test_dataset.map(speech_file_to_array_fn)
+# Preprocessing the datasets.
+# We need to read the audio files as arrays
+def evaluate(batch):
+    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
+    with torch.no_grad():
+        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
+    pred_ids = torch.argmax(logits, dim=-1)
+    batch["pred_strings"] = processor.batch_decode(pred_ids)
+    return batch
+result = test_dataset.map(evaluate, batched=True, batch_size=8)
+print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
+```
+WER without KenLM: 15.38 %
+WER With KenLM:
+**Test Result**: 7.53 %
+## Training
+The Common Voice `train`, `validation`, and ... datasets were used for training as well as ... and ...  # TODO
+The script used for training can be found [here](https://github.com/indonesian-nlp/luganda-asr)

preprocessor_config.json CHANGED Viewed

@@ -5,5 +5,6 @@
   "padding_side": "right",
   "padding_value": 0.0,
   "return_attention_mask": true,
   "sampling_rate": 16000
 }

   "padding_side": "right",
   "padding_value": 0.0,
   "return_attention_mask": true,
+  "processor_class": "Wav2Vec2ProcessorWithLM",
   "sampling_rate": 16000
 }