comodoro
/

wav2vec2-xls-r-300m-sr-cv8

+---
+language:
+- sr
+license: apache-2.0
+tags:
+- automatic-speech-recognition
+- mozilla-foundation/common_voice_8_0
+- generated_from_trainer
+- robust-speech-event
+- xlsr-fine-tuning-week
+datasets:
+- common_voice
+- name: Serbian comodoro Wav2Vec2 XLSR 300M CV8
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 8
+      type: mozilla-foundation/common_voice_8_0
+      args: hsb
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 48.3
+       - name: Test CER
+         type: cer
+         value: 18.5
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# Serbian wav2vec2-xls-r-300m-sr-cv8
+This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice dataset.
+It achieves the following results on the evaluation set:
+- Loss: 1.7302
+- Wer: 0.4825
+- Cer: 0.1847
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.0001
+- train_batch_size: 16
+- eval_batch_size: 8
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_steps: 300
+- num_epochs: 800
+- mixed_precision_training: Native AMP
+### Training results
+| Training Loss | Epoch | Step  | Validation Loss | Wer    | Cer    |
+|:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
+| 5.6536        | 15.0  | 1200  | 2.9744          | 1.0    | 1.0    |
+| 2.7935        | 30.0  | 2400  | 1.6613          | 0.8998 | 0.4670 |
+| 1.6538        | 45.0  | 3600  | 0.9248          | 0.6918 | 0.2699 |
+| 1.2446        | 60.0  | 4800  | 0.9151          | 0.6452 | 0.2398 |
+| 1.0766        | 75.0  | 6000  | 0.9110          | 0.5995 | 0.2207 |
+| 0.9548        | 90.0  | 7200  | 1.0273          | 0.5921 | 0.2149 |
+| 0.8919        | 105.0 | 8400  | 0.9929          | 0.5646 | 0.2117 |
+| 0.8185        | 120.0 | 9600  | 1.0850          | 0.5483 | 0.2069 |
+| 0.7692        | 135.0 | 10800 | 1.1001          | 0.5394 | 0.2055 |
+| 0.7249        | 150.0 | 12000 | 1.1018          | 0.5380 | 0.1958 |
+| 0.6786        | 165.0 | 13200 | 1.1344          | 0.5114 | 0.1941 |
+| 0.6432        | 180.0 | 14400 | 1.1516          | 0.5054 | 0.1905 |
+| 0.6009        | 195.0 | 15600 | 1.3149          | 0.5324 | 0.1991 |
+| 0.5773        | 210.0 | 16800 | 1.2468          | 0.5124 | 0.1903 |
+| 0.559         | 225.0 | 18000 | 1.2186          | 0.4956 | 0.1922 |
+| 0.5298        | 240.0 | 19200 | 1.4483          | 0.5333 | 0.2085 |
+| 0.5136        | 255.0 | 20400 | 1.2871          | 0.4802 | 0.1846 |
+| 0.4824        | 270.0 | 21600 | 1.2891          | 0.4974 | 0.1885 |
+| 0.4669        | 285.0 | 22800 | 1.3283          | 0.4942 | 0.1878 |
+| 0.4511        | 300.0 | 24000 | 1.4502          | 0.5002 | 0.1994 |
+| 0.4337        | 315.0 | 25200 | 1.4714          | 0.5035 | 0.1911 |
+| 0.4221        | 330.0 | 26400 | 1.4971          | 0.5124 | 0.1962 |
+| 0.3994        | 345.0 | 27600 | 1.4473          | 0.5007 | 0.1920 |
+| 0.3892        | 360.0 | 28800 | 1.3904          | 0.4937 | 0.1887 |
+| 0.373         | 375.0 | 30000 | 1.4971          | 0.4946 | 0.1902 |
+| 0.3657        | 390.0 | 31200 | 1.4208          | 0.4900 | 0.1821 |
+| 0.3559        | 405.0 | 32400 | 1.4648          | 0.4895 | 0.1835 |
+| 0.3476        | 420.0 | 33600 | 1.4848          | 0.4946 | 0.1829 |
+| 0.3276        | 435.0 | 34800 | 1.5597          | 0.4979 | 0.1873 |
+| 0.3193        | 450.0 | 36000 | 1.7329          | 0.5040 | 0.1980 |
+| 0.3078        | 465.0 | 37200 | 1.6379          | 0.4937 | 0.1882 |
+| 0.3058        | 480.0 | 38400 | 1.5878          | 0.4942 | 0.1921 |
+| 0.2987        | 495.0 | 39600 | 1.5590          | 0.4811 | 0.1846 |
+| 0.2931        | 510.0 | 40800 | 1.6001          | 0.4825 | 0.1849 |
+| 0.276         | 525.0 | 42000 | 1.7388          | 0.4942 | 0.1918 |
+| 0.2702        | 540.0 | 43200 | 1.7037          | 0.4839 | 0.1866 |
+| 0.2619        | 555.0 | 44400 | 1.6704          | 0.4755 | 0.1840 |
+| 0.262         | 570.0 | 45600 | 1.6042          | 0.4751 | 0.1865 |
+| 0.2528        | 585.0 | 46800 | 1.6402          | 0.4821 | 0.1865 |
+| 0.2442        | 600.0 | 48000 | 1.6693          | 0.4886 | 0.1862 |
+| 0.244         | 615.0 | 49200 | 1.6203          | 0.4765 | 0.1792 |
+| 0.2388        | 630.0 | 50400 | 1.6829          | 0.4830 | 0.1828 |
+| 0.2362        | 645.0 | 51600 | 1.8100          | 0.4928 | 0.1888 |
+| 0.2224        | 660.0 | 52800 | 1.7746          | 0.4932 | 0.1899 |
+| 0.2218        | 675.0 | 54000 | 1.7752          | 0.4946 | 0.1901 |
+| 0.2201        | 690.0 | 55200 | 1.6775          | 0.4788 | 0.1844 |
+| 0.2147        | 705.0 | 56400 | 1.7085          | 0.4844 | 0.1851 |
+| 0.2103        | 720.0 | 57600 | 1.7624          | 0.4848 | 0.1864 |
+| 0.2101        | 735.0 | 58800 | 1.7213          | 0.4783 | 0.1835 |
+| 0.1983        | 750.0 | 60000 | 1.7452          | 0.4848 | 0.1856 |
+| 0.2015        | 765.0 | 61200 | 1.7525          | 0.4872 | 0.1869 |
+| 0.1969        | 780.0 | 62400 | 1.7443          | 0.4844 | 0.1852 |
+| 0.2043        | 795.0 | 63600 | 1.7302          | 0.4825 | 0.1847 |
+### Framework versions
+- Transformers 4.16.2
+- Pytorch 1.10.1+cu102
+- Datasets 1.18.3
+- Tokenizers 0.11.0

added_tokens.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"<s>": 33, "</s>": 34}

config.json ADDED Viewed

	@@ -0,0 +1,108 @@

+{
+  "_name_or_path": "facebook/wav2vec2-xls-r-300m",
+  "activation_dropout": 0.0,
+  "adapter_kernel_size": 3,
+  "adapter_stride": 2,
+  "add_adapter": false,
+  "apply_spec_augment": true,
+  "architectures": [
+    "Wav2Vec2ForCTC"
+  ],
+  "attention_dropout": 0.2,
+  "bos_token_id": 1,
+  "classifier_proj_size": 256,
+  "codevector_dim": 768,
+  "contrastive_logits_temperature": 0.1,
+  "conv_bias": true,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "ctc_loss_reduction": "mean",
+  "ctc_zero_infinity": false,
+  "diversity_loss_weight": 0.1,
+  "do_stable_layer_norm": true,
+  "eos_token_id": 2,
+  "feat_extract_activation": "gelu",
+  "feat_extract_dropout": 0.0,
+  "feat_extract_norm": "layer",
+  "feat_proj_dropout": 0.1,
+  "feat_quantizer_dropout": 0.0,
+  "final_dropout": 0.0,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.2,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "layerdrop": 0.4,
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_prob": 0.5,
+  "model_type": "wav2vec2",
+  "num_adapter_layers": 3,
+  "num_attention_heads": 16,
+  "num_codevector_groups": 2,
+  "num_codevectors_per_group": 320,
+  "num_conv_pos_embedding_groups": 16,
+  "num_conv_pos_embeddings": 128,
+  "num_feat_extract_layers": 7,
+  "num_hidden_layers": 24,
+  "num_negatives": 100,
+  "output_hidden_size": 1024,
+  "pad_token_id": 32,
+  "proj_codevector_dim": 768,
+  "tdnn_dilation": [
+    1,
+    2,
+    3,
+    1,
+    1
+  ],
+  "tdnn_dim": [
+    512,
+    512,
+    512,
+    512,
+    1500
+  ],
+  "tdnn_kernel": [
+    5,
+    3,
+    3,
+    1,
+    1
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.16.2",
+  "use_weighted_layer_sum": false,
+  "vocab_size": 35,
+  "xvector_output_dim": 512
+}

eval.py ADDED Viewed

	@@ -0,0 +1,186 @@

+#!/usr/bin/env python3
+from datasets import load_dataset, load_metric, Audio, Dataset
+from transformers import pipeline, AutoFeatureExtractor, AutoTokenizer, Wav2Vec2ForCTC
+import os
+import re
+import argparse
+import unicodedata
+from typing import Dict
+def log_results(result: Dataset, args: Dict[str, str]):
+    """ DO NOT CHANGE. This function computes and logs the result metrics. """
+    log_outputs = args.log_outputs
+    dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
+    # load metric
+    wer = load_metric("wer")
+    cer = load_metric("cer")
+    # compute metrics
+    wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
+    cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
+    # print & log results
+    result_str = (
+        f"WER: {wer_result}\n"
+        f"CER: {cer_result}"
+    )
+    print(result_str)
+    with open(f"{dataset_id}_eval_results.txt", "w") as f:
+        f.write(result_str)
+    # log all results in text file. Possibly interesting for analysis
+    if log_outputs is not None:
+        pred_file = f"log_{dataset_id}_predictions.txt"
+        target_file = f"log_{dataset_id}_targets.txt"
+        with open(pred_file, "w") as p, open(target_file, "w") as t:
+            # mapping function to write output
+            def write_to_file(batch, i):
+                p.write(f"{i}" + "\n")
+                p.write(batch["prediction"] + "\n")
+                t.write(f"{i}" + "\n")
+                t.write(batch["target"] + "\n")
+            result.map(write_to_file, with_indices=True)
+def normalize_text(text: str) -> str:
+    """ DO ADAPT FOR YOUR USE CASE. this function normalizes the target text. """
+    CHARS = {
+    'ü': 'ue',
+    'ö': 'oe',
+    'ï': 'i',
+    'ë': 'e',
+    'ä': 'ae',
+    'ã': 'a',
+    'à': 'á',
+    'ø': 'o',
+    'è': 'é',
+    'ê': 'é',
+    'å': 'ó',
+    'î': 'i',
+    'ñ': 'ň',
+    'ç': 's',
+    'ľ': 'l',
+    'ż': 'ž',
+    'ł': 'w',
+    'ć': 'č',
+    'þ': 't',
+    'ß': 'ss',
+    'ę': 'en',
+    'ą': 'an',
+    'æ': 'ae',
+  }
+    def replace_chars(sentence):
+      result = ''
+      for ch in sentence:
+        new = CHARS[ch] if ch in CHARS else ch
+        result += new
+      return result
+    chars_to_ignore_regex = '[\,\?\.\!\-\;\:\/\"\“\„\%\”\�\–\'\`\«\»\—\’\…]'
+    text = text.lower()
+    # normalize non-standard (stylized) unicode characters
+    text = unicodedata.normalize('NFKC', text)
+    # remove punctuation
+    text = re.sub(chars_to_ignore_regex, "", text)
+    text = replace_chars(text)
+    # Let's also make sure we split on all kinds of newlines, spaces, etc...
+    text = " ".join(text.split())
+    return text
+def main(args):
+    # load dataset
+    dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
+    # for testing: only process the first two examples as a test
+    if args.limit:
+        dataset = dataset.select(range(limit))
+    asr = None
+    feature_extractor = None
+    if not args.model_id and not args.path:
+        raise RuntimeError('No model given!')
+    if not args.model_id:
+        model = Wav2Vec2ForCTC.from_pretrained(args.path)
+        tokenizer = AutoTokenizer.from_pretrained(args.path)
+        feature_extractor = AutoFeatureExtractor.from_pretrained(args.path)
+        # load eval pipeline
+        asr = pipeline("automatic-speech-recognition", model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)
+    else:
+        feature_extractor = AutoFeatureExtractor.from_pretrained(args.model_id)
+        asr = pipeline("automatic-speech-recognition", model=args.model_id)
+    # map function to decode audio
+    def map_to_pred(batch):
+        prediction = asr(batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s)
+        batch["prediction"] = prediction["text"]
+        batch["target"] = normalize_text(batch["sentence"])
+        return batch
+    # load processor
+    sampling_rate = feature_extractor.sampling_rate
+    # resample audio
+    dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
+# run inference on all examples
+    result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
+    # compute and log_results
+    # do not change function below
+    log_results(result, args)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_id", type=str, help="Model identifier. Should be loadable with 🤗 Transformers", default=''
+    )
+    parser.add_argument(
+        "--dataset", type=str, required=True, help="Dataset name to evaluate the model. Should be loadable with 🤗 Datasets"
+    )
+    parser.add_argument(
+        "--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'`  for Common Voice"
+    )
+    parser.add_argument(
+        "--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`"
+    )
+    parser.add_argument(
+        "--chunk_length_s", type=float, default=None, help="Chunk length in seconds. Defaults to None. For long audio files a good value would be 5.0 seconds."
+    )
+    parser.add_argument(
+        "--stride_length_s", type=float, default=None, help="Stride of the audio chunks. Defaults to None. For long audio files a good value would be 1.0 seconds."
+    )
+    parser.add_argument(
+        "--log_outputs", action='store_true', help="If defined, write outputs to log file for analysis."
+    )
+    parser.add_argument(
+        "--path", type=str, help="If set and model_id is not set, use local model from this path.", default=''
+    )
+    parser.add_argument(
+        "--limit", type=int, help="Not required. If greater than zero, select a subset of this size from the dataset.", default=0
+    )
+    args = parser.parse_args()
+    main(args)

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "do_normalize": true,
+  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "Wav2Vec2Processor",
+  "return_attention_mask": true,
+  "sampling_rate": 16000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4837db0336473532f001e0c66f905b7e4e446b9e5c4fe5f19dbbd0f2e8184002
+size 1262067185

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "[UNK]", "pad_token": "[PAD]", "additional_special_tokens": [{"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}]}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "[PAD]", "do_lower_case": false, "word_delimiter_token": "\|", "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "./", "tokenizer_class": "Wav2Vec2CTCTokenizer", "processor_class": "Wav2Vec2Processor"}

vocab.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"а": 1, "б": 2, "в": 3, "г": 4, "д": 5, "е": 6, "ж": 7, "з": 8, "и": 9, "к": 10, "л": 11, "м": 12, "н": 13, "о": 14, "п": 15, "р": 16, "с": 17, "т": 18, "у": 19, "ф": 20, "х": 21, "ц": 22, "ч": 23, "ш": 24, "ђ": 25, "ј": 26, "љ": 27, "њ": 28, "ћ": 29, "џ": 30, "\|": 0, "[UNK]": 31, "[PAD]": 32}