Plim commited on Feb 7, 2022

Commit

e2f6d01

1 Parent(s): 89ae304

Training in progress, step 14000

Browse files

Files changed (32) hide show

.gitattributes +1 -0
.ipynb_checkpoints/README-checkpoint.md +105 -0
.ipynb_checkpoints/create_lm-checkpoint.ipynb +309 -0
.ipynb_checkpoints/log_mozilla-foundation_common_voice_8_0_fr_test_predictions-checkpoint.txt +0 -0
.ipynb_checkpoints/log_mozilla-foundation_common_voice_8_0_fr_test_targets-checkpoint.txt +0 -0
.ipynb_checkpoints/log_speech-recognition-community-v2_dev_data_fr_validation_predictions-checkpoint.txt +0 -0
.ipynb_checkpoints/log_speech-recognition-community-v2_dev_data_fr_validation_targets-checkpoint.txt +0 -0
.ipynb_checkpoints/mozilla-foundation_common_voice_8_0_fr_test_eval_results-checkpoint.txt +2 -0
.ipynb_checkpoints/preprocessor_config-checkpoint.json +10 -0
.ipynb_checkpoints/run-checkpoint.sh +2 -2
alphabet.json +1 -0
config.json +1 -1
create_lm.ipynb +344 -0
keep_model/pytorch_model.bin +3 -0
langague_model/5gram.bin +3 -0
langague_model/attrs.json +1 -0
langague_model/unigrams.txt +0 -0
pytorch_model.bin +1 -1
run.sh +2 -2
training_args.bin +1 -1
wandb/debug-internal.log +1 -1
wandb/debug.log +1 -1
wandb/latest-run +1 -1
wandb/run-20220206_201634-uhiy9e2t/files/conda-environment.yaml +0 -0
wandb/run-20220206_201634-uhiy9e2t/files/config.yaml +0 -0
wandb/run-20220206_201634-uhiy9e2t/files/output.log +1491 -0
wandb/run-20220206_201634-uhiy9e2t/files/requirements.txt +183 -0
wandb/run-20220206_201634-uhiy9e2t/files/wandb-metadata.json +61 -0
wandb/run-20220206_201634-uhiy9e2t/files/wandb-summary.json +0 -0
wandb/run-20220206_201634-uhiy9e2t/logs/debug-internal.log +0 -0
wandb/run-20220206_201634-uhiy9e2t/logs/debug.log +26 -0
wandb/run-20220206_201634-uhiy9e2t/run-uhiy9e2t.wandb +3 -0

.gitattributes CHANGED Viewed

@@ -26,3 +26,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zstandard filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 wandb/run-20220203_170643-2fkfdtzb/run-2fkfdtzb.wandb filter=lfs diff=lfs merge=lfs -text

 *.zstandard filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 wandb/run-20220203_170643-2fkfdtzb/run-2fkfdtzb.wandb filter=lfs diff=lfs merge=lfs -text
+wandb/run-20220206_201634-uhiy9e2t/run-uhiy9e2t.wandb filter=lfs diff=lfs merge=lfs -text

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,105 @@

+---
+language:
+- fr
+license: apache-2.0
+tags:
+- automatic-speech-recognition
+- mozilla-foundation/common_voice_8_0
+- generated_from_trainer
+- robust-speech-event
+model-index:
+- name: XLS-R-1B - French
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 8
+      type: mozilla-foundation/common_voice_8_0
+      args: fr
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 18.33
+       - name: Test CER
+         type: cer
+         value: 5.60
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Robust Speech Event - Dev Data
+      type: speech-recognition-community-v2/dev_data
+      args: fr
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 60.25
+       - name: Test CER
+         type: cer
+         value: 15.68
+---
+## Model description
+This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - FR dataset.
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 7.5e-05
+- train_batch_size: 16
+- eval_batch_size: 16
+- seed: 42
+- gradient_accumulation_steps: 8
+- total_train_batch_size: 128
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_steps: 2000
+- num_epochs: 4.0
+- mixed_precision_training: Native AMP
+### Training results
+| Training Loss | Epoch | Step  | Validation Loss | Wer    |
+|:-------------:|:-----:|:-----:|:---------------:|:------:|
+| 0.9827        | 0.29  | 1000  | inf             | 0.2937 |
+| 1.0203        | 0.57  | 2000  | inf             | 0.2711 |
+| 1.0048        | 0.86  | 3000  | inf             | 0.2620 |
+| 0.9858        | 1.15  | 4000  | inf             | 0.2522 |
+| 0.9709        | 1.43  | 5000  | inf             | 0.2365 |
+| 0.9347        | 1.72  | 6000  | inf             | 0.2332 |
+| 0.9256        | 2.01  | 7000  | inf             | 0.2261 |
+| 0.8936        | 2.29  | 8000  | inf             | 0.2203 |
+| 0.877         | 2.58  | 9000  | inf             | 0.2096 |
+| 0.8393        | 2.87  | 10000 | inf             | 0.2017 |
+| 0.8156        | 3.15  | 11000 | inf             | 0.1936 |
+| 0.8015        | 3.44  | 12000 | inf             | 0.1880 |
+| 0.774         | 3.73  | 13000 | inf             | 0.1834 |
+It achieves the best result on the validation set on STEP 13000:
+- Wer: 0.1834
+Some problem occurs when calculating the validation loss.
+### Framework versions
+- Transformers 4.17.0.dev0
+- Pytorch 1.10.2+cu102
+- Datasets 1.18.3.dev0
+- Tokenizers 0.11.0
+### Evaluation Commands
+1. To evaluate on `mozilla-foundation/common_voice_8` with split `test`
+```bash
+python eval.py --model_id Plim/xls-r-1b-cv_8-fr --dataset mozilla-foundation/common_voice_8_0 --config fr --split test
+```
+2. To evaluate on `speech-recognition-community-v2/dev_data`
+```bash
+python eval.py --model_id Plim/xls-r-1b-cv_8-fr --dataset speech-recognition-community-v2/dev_data --config fr --split validation --chunk_length_s 5.0 --stride_length_s 1.0
+```

.ipynb_checkpoints/create_lm-checkpoint.ipynb ADDED Viewed

	@@ -0,0 +1,309 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "7b5f7142",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import transformers\n",
+    "from datasets import load_dataset\n",
+    "import re"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "4ad6422f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "username = \"Plim\"  # change to your username\n",
+    "target_lang = \"fr\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "37b2c1d6",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "f230feb459c441a9a11e53b867e8914a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/2.60k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a4a8fa35d48f4a6db8072baed6b2389b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/29.6k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using custom data configuration en-fr-lang1=en,lang2=fr\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading and preparing dataset europarl_bilingual/en-fr (download: 278.07 MiB, generated: 643.66 MiB, post-processed: Unknown size, total: 921.72 MiB) to /workspace/.cache/huggingface/datasets/europarl_bilingual/en-fr-lang1=en,lang2=fr/8.0.0/2ab0200e7729616bfd4a4df6bfb29b31746ceb5a59f8c75c02ca35e1ebead950...\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "aa00b5d6dc154449861dddcf9f0d2fc8",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/142M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "563096fc78454333b5ae23e87a7e3469",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/140M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "eba05e5151b34505b9a43e383cb6cfe0",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/9.30M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "0 examples [00:00, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset europarl_bilingual downloaded and prepared to /workspace/.cache/huggingface/datasets/europarl_bilingual/en-fr-lang1=en,lang2=fr/8.0.0/2ab0200e7729616bfd4a4df6bfb29b31746ceb5a59f8c75c02ca35e1ebead950. Subsequent calls will reuse this data.\n"
+     ]
+    }
+   ],
+   "source": [
+    "dataset = load_dataset(\"europarl_bilingual\", lang1=\"en\", lang2=target_lang, split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "81259294",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def extract_text(batch):\n",
+    "    target_lang = \"fr\"\n",
+    "    chars_to_ignore_regex = '[^a-zàâäçéèêëîïôöùûüÿ\\'’ ]'\n",
+    "    text = batch[\"translation\"][target_lang]\n",
+    "    batch[\"text\"] = re.sub(chars_to_ignore_regex, \"\", text.lower()).replace('’', \"'\")\n",
+    "    return batch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "2dec7b80",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "00d998de52544f6c8750c53bc0c85d66",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "0ex [00:00, ?ex/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "dataset = dataset.map(extract_text, remove_columns=dataset.column_names)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "c6feaf74",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "461a219cdb6d42b2b890ec028c336e7f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "dataset.push_to_hub(f\"{target_lang}_corpora_parliament_processed\", split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "b0e6ae25",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"text.txt\", \"w\") as file:\n",
+    "    file.write(\" \".join(dataset[\"text\"]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "f95596a5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"5gram.arpa\", \"r\") as read_file, open(\"5gram_correct.arpa\", \"w\") as write_file:\n",
+    "    has_added_eos = False\n",
+    "    for line in read_file:\n",
+    "        if not has_added_eos and \"ngram 1=\" in line:\n",
+    "            count=line.strip().split(\"=\")[-1]\n",
+    "            write_file.write(line.replace(f\"{count}\", f\"{int(count)+1}\"))\n",
+    "        elif not has_added_eos and \"<s>\" in line:\n",
+    "            write_file.write(line)\n",
+    "            write_file.write(line.replace(\"<s>\", \"</s>\"))\n",
+    "            has_added_eos = True\n",
+    "        else:\n",
+    "            write_file.write(line)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "f6489f25",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "file ./config.json not found\n"
+     ]
+    },
+    {
+     "ename": "OSError",
+     "evalue": "Can't load config for './'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './' is the correct path to a directory containing a config.json file",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mOSError\u001b[0m                                   Traceback (most recent call last)",
+      "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py:585\u001b[0m, in \u001b[0;36mPretrainedConfig._get_config_dict\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m    583\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m    584\u001b[0m     \u001b[38;5;66;03m# Load from URL or cache if already cached\u001b[39;00m\n\u001b[0;32m--> 585\u001b[0m     resolved_config_file \u001b[38;5;241m=\u001b[39m \u001b[43mcached_path\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    586\u001b[0m \u001b[43m        \u001b[49m\u001b[43mconfig_file\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    587\u001b[0m \u001b[43m        \u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    588\u001b[0m \u001b[43m        \u001b[49m\u001b[43mforce_download\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mforce_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    589\u001b[0m \u001b[43m        \u001b[49m\u001b[43mproxies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mproxies\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    590\u001b[0m \u001b[43m        \u001b[49m\u001b[43mresume_download\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mresume_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    591\u001b[0m \u001b[43m        \u001b[49m\u001b[43mlocal_files_only\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlocal_files_only\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    592\u001b[0m \u001b[43m        \u001b[49m\u001b[43muse_auth_token\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_auth_token\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    593\u001b[0m \u001b[43m        \u001b[49m\u001b[43muser_agent\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muser_agent\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    594\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    596\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m RepositoryNotFoundError \u001b[38;5;28;01mas\u001b[39;00m err:\n",
+      "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py:1861\u001b[0m, in \u001b[0;36mcached_path\u001b[0;34m(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)\u001b[0m\n\u001b[1;32m   1859\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m urlparse(url_or_filename)\u001b[38;5;241m.\u001b[39mscheme \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m   1860\u001b[0m     \u001b[38;5;66;03m# File, but it doesn't exist.\u001b[39;00m\n\u001b[0;32m-> 1861\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mEnvironmentError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfile \u001b[39m\u001b[38;5;132;01m{\u001b[39;00murl_or_filename\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m not found\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m   1862\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m   1863\u001b[0m     \u001b[38;5;66;03m# Something unknown\u001b[39;00m\n",
+      "\u001b[0;31mOSError\u001b[0m: file ./config.json not found",
+      "\nDuring handling of the above exception, another exception occurred:\n",
+      "\u001b[0;31mOSError\u001b[0m                                   Traceback (most recent call last)",
+      "Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mtransformers\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m AutoProcessor\n\u001b[0;32m----> 3\u001b[0m processor \u001b[38;5;241m=\u001b[39m \u001b[43mAutoProcessor\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_pretrained\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m./\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/models/auto/processing_auto.py:178\u001b[0m, in \u001b[0;36mAutoProcessor.from_pretrained\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m    176\u001b[0m \u001b[38;5;66;03m# Otherwise, load config, if it can be loaded.\u001b[39;00m\n\u001b[1;32m    177\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(config, PretrainedConfig):\n\u001b[0;32m--> 178\u001b[0m     config \u001b[38;5;241m=\u001b[39m \u001b[43mAutoConfig\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_pretrained\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpretrained_model_name_or_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    180\u001b[0m model_type \u001b[38;5;241m=\u001b[39m config_class_to_model_type(\u001b[38;5;28mtype\u001b[39m(config)\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m)\n\u001b[1;32m    182\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(config, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mprocessor_class\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m) \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n",
+      "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:617\u001b[0m, in \u001b[0;36mAutoConfig.from_pretrained\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m    615\u001b[0m kwargs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mname_or_path\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m pretrained_model_name_or_path\n\u001b[1;32m    616\u001b[0m trust_remote_code \u001b[38;5;241m=\u001b[39m kwargs\u001b[38;5;241m.\u001b[39mpop(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtrust_remote_code\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[0;32m--> 617\u001b[0m config_dict, _ \u001b[38;5;241m=\u001b[39m \u001b[43mPretrainedConfig\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_config_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpretrained_model_name_or_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    618\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mauto_map\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m config_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAutoConfig\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m config_dict[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mauto_map\u001b[39m\u001b[38;5;124m\"\u001b[39m]:\n\u001b[1;32m    619\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m trust_remote_code:\n",
+      "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py:537\u001b[0m, in \u001b[0;36mPretrainedConfig.get_config_dict\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m    535\u001b[0m original_kwargs \u001b[38;5;241m=\u001b[39m copy\u001b[38;5;241m.\u001b[39mdeepcopy(kwargs)\n\u001b[1;32m    536\u001b[0m \u001b[38;5;66;03m# Get config dict associated with the base config file\u001b[39;00m\n\u001b[0;32m--> 537\u001b[0m config_dict, kwargs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mcls\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_config_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpretrained_model_name_or_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    539\u001b[0m \u001b[38;5;66;03m# That config file may point us toward another config file to use.\u001b[39;00m\n\u001b[1;32m    540\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mconfiguration_files\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m config_dict:\n",
+      "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py:626\u001b[0m, in \u001b[0;36mPretrainedConfig._get_config_dict\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m    624\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mEnvironmentError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[1;32m    625\u001b[0m     logger\u001b[38;5;241m.\u001b[39merror(err)\n\u001b[0;32m--> 626\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mEnvironmentError\u001b[39;00m(\n\u001b[1;32m    627\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCan\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt load config for \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mpretrained_model_name_or_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m. If you were trying to load it from \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    628\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mhttps://huggingface.co/models\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m, make sure you don\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt have a local directory with the same name. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    629\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mOtherwise, make sure \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mpretrained_model_name_or_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m is the correct path to a directory \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    630\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcontaining a \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mconfiguration_file\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m file\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    631\u001b[0m     )\n\u001b[1;32m    633\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m    634\u001b[0m     \u001b[38;5;66;03m# Load config dict\u001b[39;00m\n\u001b[1;32m    635\u001b[0m     config_dict \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mcls\u001b[39m\u001b[38;5;241m.\u001b[39m_dict_from_json_file(resolved_config_file)\n",
+      "\u001b[0;31mOSError\u001b[0m: Can't load config for './'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './' is the correct path to a directory containing a config.json file"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoProcessor\n",
+    "\n",
+    "processor = AutoProcessor.from_pretrained(\"./\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ab24f645",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

.ipynb_checkpoints/log_mozilla-foundation_common_voice_8_0_fr_test_predictions-checkpoint.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

.ipynb_checkpoints/log_mozilla-foundation_common_voice_8_0_fr_test_targets-checkpoint.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

.ipynb_checkpoints/log_speech-recognition-community-v2_dev_data_fr_validation_predictions-checkpoint.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

.ipynb_checkpoints/log_speech-recognition-community-v2_dev_data_fr_validation_targets-checkpoint.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

.ipynb_checkpoints/mozilla-foundation_common_voice_8_0_fr_test_eval_results-checkpoint.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ WER: 0.18333515105245937
2	+ CER: 0.05606368028384753

.ipynb_checkpoints/preprocessor_config-checkpoint.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "do_normalize": true,
+  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0,
+  "processor_class": "Wav2Vec2ProcessorWithLM",
+  "return_attention_mask": true,
+  "sampling_rate": 16000
+}

.ipynb_checkpoints/run-checkpoint.sh CHANGED Viewed

@@ -20,8 +20,8 @@ python run_speech_recognition_ctc.py \
     --mask_feature_prob="0.25" \
     --mask_time_length="10" \
     --mask_time_prob="0.75" \
-    --model_name_or_path="facebook/wav2vec2-xls-r-1b" \
-    --num_train_epochs="4.0" \
     --output_dir="./" \
     --overwrite_output_dir \
     --per_device_train_batch_size="16" \

     --mask_feature_prob="0.25" \
     --mask_time_length="10" \
     --mask_time_prob="0.75" \
+    --model_name_or_path="./checkpoint-13000" \
+    --num_train_epochs="6.0" \
     --output_dir="./" \
     --overwrite_output_dir \
     --per_device_train_batch_size="16" \

alphabet.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"labels": [" ", "'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "\u00e0", "\u00e2", "\u00e4", "\u00e7", "\u00e8", "\u00e9", "\u00ea", "\u00eb", "\u00ee", "\u00ef", "\u00f4", "\u00f6", "\u00f9", "\u00fb", "\u00fc", "\u00ff", "\u2047", ""], "is_bpe": false}

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "facebook/wav2vec2-xls-r-1b",
   "activation_dropout": 0.1,
   "adapter_kernel_size": 3,
   "adapter_stride": 2,

 {
+  "_name_or_path": "./checkpoint-13000",
   "activation_dropout": 0.1,
   "adapter_kernel_size": 3,
   "adapter_stride": 2,

create_lm.ipynb ADDED Viewed

	@@ -0,0 +1,344 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "d354f2ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import transformers\n",
+    "from datasets import load_dataset\n",
+    "import re"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "fe33d468",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "username = \"Plim\"  # change to your username\n",
+    "target_lang = \"fr\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "f84ba325",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "f230feb459c441a9a11e53b867e8914a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/2.60k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a4a8fa35d48f4a6db8072baed6b2389b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/29.6k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using custom data configuration en-fr-lang1=en,lang2=fr\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading and preparing dataset europarl_bilingual/en-fr (download: 278.07 MiB, generated: 643.66 MiB, post-processed: Unknown size, total: 921.72 MiB) to /workspace/.cache/huggingface/datasets/europarl_bilingual/en-fr-lang1=en,lang2=fr/8.0.0/2ab0200e7729616bfd4a4df6bfb29b31746ceb5a59f8c75c02ca35e1ebead950...\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "aa00b5d6dc154449861dddcf9f0d2fc8",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/142M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "563096fc78454333b5ae23e87a7e3469",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/140M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "eba05e5151b34505b9a43e383cb6cfe0",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading:   0%|          | 0.00/9.30M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "0 examples [00:00, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset europarl_bilingual downloaded and prepared to /workspace/.cache/huggingface/datasets/europarl_bilingual/en-fr-lang1=en,lang2=fr/8.0.0/2ab0200e7729616bfd4a4df6bfb29b31746ceb5a59f8c75c02ca35e1ebead950. Subsequent calls will reuse this data.\n"
+     ]
+    }
+   ],
+   "source": [
+    "dataset = load_dataset(\"europarl_bilingual\", lang1=\"en\", lang2=target_lang, split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "c26261e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def extract_text(batch):\n",
+    "    target_lang = \"fr\"\n",
+    "    chars_to_ignore_regex = '[^a-zàâäçéèêëîïôöùûüÿ\\'’ ]'\n",
+    "    text = batch[\"translation\"][target_lang]\n",
+    "    batch[\"text\"] = re.sub(chars_to_ignore_regex, \"\", text.lower()).replace('’', \"'\")\n",
+    "    return batch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "5434c0b7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "00d998de52544f6c8750c53bc0c85d66",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "0ex [00:00, ?ex/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "dataset = dataset.map(extract_text, remove_columns=dataset.column_names)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "e1c780b8",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "461a219cdb6d42b2b890ec028c336e7f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "dataset.push_to_hub(f\"{target_lang}_corpora_parliament_processed\", split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "41c0ab30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"text.txt\", \"w\") as file:\n",
+    "    file.write(\" \".join(dataset[\"text\"]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "4d6bfb67",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"language_model/5gram.arpa\", \"r\") as read_file, open(\"language_model/5gram_correct.arpa\", \"w\") as write_file:\n",
+    "    has_added_eos = False\n",
+    "    for line in read_file:\n",
+    "        if not has_added_eos and \"ngram 1=\" in line:\n",
+    "            count=line.strip().split(\"=\")[-1]\n",
+    "            write_file.write(line.replace(f\"{count}\", f\"{int(count)+1}\"))\n",
+    "        elif not has_added_eos and \"<s>\" in line:\n",
+    "            write_file.write(line)\n",
+    "            write_file.write(line.replace(\"<s>\", \"</s>\"))\n",
+    "            has_added_eos = True\n",
+    "        else:\n",
+    "            write_file.write(line)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "3407085c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoProcessor\n",
+    "\n",
+    "processor = AutoProcessor.from_pretrained(\"./\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "5a60df92",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vocab_dict = processor.tokenizer.get_vocab()\n",
+    "sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "cd1a94ea",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Loading the LM will be faster if you build a binary file.\n",
+      "Reading /workspace/xls-r-1b-cv_8-fr/language_model/5gram_correct.arpa\n",
+      "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
+      "****************************************************************************************************\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pyctcdecode import build_ctcdecoder\n",
+    "\n",
+    "decoder = build_ctcdecoder(\n",
+    "    labels=list(sorted_vocab_dict.keys()),\n",
+    "    kenlm_model_path=\"language_model/5gram_correct.arpa\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "e627079a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import Wav2Vec2ProcessorWithLM\n",
+    "\n",
+    "processor_with_lm = Wav2Vec2ProcessorWithLM(\n",
+    "    feature_extractor=processor.feature_extractor,\n",
+    "    tokenizer=processor.tokenizer,\n",
+    "    decoder=decoder\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "bc665f62",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "processor_with_lm.save_pretrained(\"Plim/xls-r-1b-cv_8-fr\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7bcbb30b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

keep_model/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a7ac9a4075231a9b1f2ef054fe1161fdf7235b6c7bd018f7505d44da3332960
+size 3850548401

langague_model/5gram.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:726c0eaeadf24aa621faaddc6640ddc431a65f45e1b16ff0e6a9af565facd09f
+size 2075344331

langague_model/attrs.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"alpha": 0.5, "beta": 1.5, "unk_score_offset": -10.0, "score_boundary": true}

langague_model/unigrams.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4a7ac9a4075231a9b1f2ef054fe1161fdf7235b6c7bd018f7505d44da3332960
 size 3850548401

 version https://git-lfs.github.com/spec/v1
+oid sha256:581f4c8322c68fc308f22b68839669bf48755ac5706016bfe60c0234ba26e947
 size 3850548401

run.sh CHANGED Viewed

@@ -20,8 +20,8 @@ python run_speech_recognition_ctc.py \
     --mask_feature_prob="0.25" \
     --mask_time_length="10" \
     --mask_time_prob="0.75" \
-    --model_name_or_path="facebook/wav2vec2-xls-r-1b" \
-    --num_train_epochs="4.0" \
     --output_dir="./" \
     --overwrite_output_dir \
     --per_device_train_batch_size="16" \

     --mask_feature_prob="0.25" \
     --mask_time_length="10" \
     --mask_time_prob="0.75" \
+    --model_name_or_path="./checkpoint-13000" \
+    --num_train_epochs="6.0" \
     --output_dir="./" \
     --overwrite_output_dir \
     --per_device_train_batch_size="16" \

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:af168053e52049b75214ffe031549b7a5ed7c0e774b3f727dc7f6c7d61dd0f9c
 size 2991

 version https://git-lfs.github.com/spec/v1
+oid sha256:e48a427c7cb614f5fbcb1a4088c5c4abce496c5c8b09e2e526d99a2586ab2194
 size 2991

wandb/debug-internal.log CHANGED Viewed

	@@ -1 +1 @@
1	- run-~~20220203_170643~~-~~2fkfdtzb~~/logs/debug-internal.log


1	+ run-20220206_201634-uhiy9e2t/logs/debug-internal.log

wandb/debug.log CHANGED Viewed

	@@ -1 +1 @@
1	- run-~~20220203_170643~~-~~2fkfdtzb~~/logs/debug.log


1	+ run-20220206_201634-uhiy9e2t/logs/debug.log

wandb/latest-run CHANGED Viewed

	@@ -1 +1 @@
1	- run-~~20220203_170643~~-~~2fkfdtzb~~


1	+ run-20220206_201634-uhiy9e2t

wandb/run-20220206_201634-uhiy9e2t/files/conda-environment.yaml ADDED Viewed

File without changes

wandb/run-20220206_201634-uhiy9e2t/files/config.yaml ADDED Viewed

The diff for this file is too large to render. See raw diff

wandb/run-20220206_201634-uhiy9e2t/files/output.log ADDED Viewed

	@@ -0,0 +1,1491 @@

+  0%|                                                                                                                                         | 0/20928 [00:00<?, ?it/s]
+ 63%|███████████████████████████████████████████████████████████████████████████                                             | 13100/20928 [2:55:49<21:56:54, 10.09s/it]
+ 63%|███████████████████████████████████████████████████████████████████████████▋                                            | 13199/20928 [3:15:41<21:46:16, 10.14s/it]
+ 64%|████████████████████████████████████████████████████████████████████████████▎                                           | 13300/20928 [3:36:04<21:40:19, 10.23s/it]
+ 64%|████████████████████████████████████████████████████████████████████████████▊                                           | 13399/20928 [3:55:59<21:13:16, 10.15s/it]
+ 65%|█████████████████████████████████████████████████████████████████████████████▍                                          | 13499/20928 [4:16:06<20:57:54, 10.16s/it]
+ 65%|█████████████████████████████████████████████████████████████████████████████▉                                          | 13600/20928 [4:36:22<20:53:20, 10.26s/it]
+ 65%|██████████████████████████████████████████████████████████████████████████████▌                                         | 13700/20928 [4:56:31<20:22:11, 10.15s/it]
+ 66%|███████████████████████████████████████████████████████████████████████████████▏                                        | 13800/20928 [5:16:40<20:11:03, 10.19s/it]
+ 66%|███████████████████████████████████████████████████████████████████████████████▋                                        | 13900/20928 [5:36:46<19:42:38, 10.10s/it]
+***** Running Evaluation *****███████████████████████████████████████████████████████▎                                       | 14000/20928 [5:56:12<14:37:51,  7.60s/it]
+  Num examples = 16021
+  Batch size = 16
+{'loss': 0.8372, 'learning_rate': 2.748705621301775e-05, 'epoch': 4.01}
+100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1002/1002 [15:59<00:00,  1.76it/s]
+Saving model checkpoint to ./checkpoint-14000
+Configuration saved in ./checkpoint-14000/config.json████████████████████████████████▎                                       | 14000/20928 [6:12:25<14:37:51,  7.60s/it]
+Model weights saved in ./checkpoint-14000/pytorch_model.bin
+Configuration saved in ./checkpoint-14000/preprocessor_config.json
+Configuration saved in ./preprocessor_config.json
+02/07/2022 02:32:50 - WARNING - huggingface_hub.repository - Adding files tracked by Git LFS: ['wandb/run-20220206_201634-uhiy9e2t/run-uhiy9e2t.wandb']. This may take a bit of time if the files are large.

wandb/run-20220206_201634-uhiy9e2t/files/requirements.txt ADDED Viewed

	@@ -0,0 +1,183 @@

+aiohttp==3.8.1
+aiosignal==1.2.0
+analytics-python==1.4.0
+anyio==3.5.0
+appdirs==1.4.4
+argon2-cffi-bindings==21.2.0
+argon2-cffi==21.3.0
+asgiref==3.5.0
+asttokens==2.0.5
+async-timeout==4.0.2
+attrs==21.4.0
+audioread==2.1.9
+backcall==0.2.0
+backoff==1.10.0
+bcrypt==3.2.0
+beautifulsoup4==4.9.3
+black==22.1.0
+bleach==4.1.0
+brotlipy==0.7.0
+certifi==2020.12.5
+cffi==1.14.3
+chardet==3.0.4
+charset-normalizer==2.0.11
+click==8.0.3
+conda-build==3.21.4
+conda-package-handling==1.7.2
+conda==4.9.2
+cryptography==3.2.1
+cycler==0.11.0
+datasets==1.18.3.dev0
+debugpy==1.5.1
+decorator==4.4.2
+defusedxml==0.7.1
+dill==0.3.4
+dnspython==2.1.0
+docker-pycreds==0.4.0
+entrypoints==0.3
+executing==0.8.2
+fastapi==0.73.0
+ffmpy==0.3.0
+filelock==3.0.12
+fonttools==4.29.1
+frozenlist==1.3.0
+fsspec==2022.1.0
+gitdb==4.0.9
+gitpython==3.1.26
+glob2==0.7
+gradio==2.7.5.2
+h11==0.13.0
+huggingface-hub==0.4.0
+hypothesis==6.36.1
+idna==2.10
+importlib-resources==5.4.0
+ipykernel==6.7.0
+ipython-genutils==0.2.0
+ipython==8.0.1
+ipywidgets==7.6.3
+jedi==0.17.0
+jinja2==2.11.3
+jiwer==2.3.0
+joblib==1.1.0
+json5==0.9.6
+jsonschema==4.4.0
+jupyter-client==7.1.2
+jupyter-core==4.9.1
+jupyterlab-pygments==0.1.2
+jupyterlab-server==1.2.0
+jupyterlab-widgets==1.0.2
+jupyterlab==2.2.9
+kiwisolver==1.3.2
+libarchive-c==2.9
+librosa==0.8.1
+llvmlite==0.38.0
+markdown2==2.4.2
+markupsafe==1.1.1
+matplotlib-inline==0.1.3
+matplotlib==3.5.1
+mistune==0.8.4
+mkl-fft==1.3.0
+mkl-random==1.1.1
+mkl-service==2.3.0
+monotonic==1.6
+multidict==6.0.2
+multiprocess==0.70.12.2
+mypy-extensions==0.4.3
+nano==0.10.0
+nbclient==0.5.10
+nbconvert==6.4.1
+nbformat==5.1.3
+nest-asyncio==1.5.4
+notebook==6.4.8
+numba==0.55.1
+numpy==1.19.2
+olefile==0.46
+packaging==21.3
+pandas==1.4.0
+pandocfilters==1.5.0
+paramiko==2.9.2
+parso==0.8.1
+pathspec==0.9.0
+pathtools==0.1.2
+pexpect==4.8.0
+pickleshare==0.7.5
+pillow==8.1.2
+pip==22.0.2
+pkginfo==1.7.0
+platformdirs==2.4.1
+pooch==1.6.0
+prometheus-client==0.13.1
+promise==2.3
+prompt-toolkit==3.0.8
+protobuf==3.19.4
+psutil==5.8.0
+ptyprocess==0.7.0
+pure-eval==0.2.2
+pyarrow==6.0.1
+pycosat==0.6.3
+pycparser==2.20
+pycryptodome==3.14.0
+pyctcdecode==0.3.0
+pydantic==1.9.0
+pydub==0.25.1
+pygments==2.8.0
+pygtrie==2.4.2
+pynacl==1.5.0
+pyopenssl==19.1.0
+pyparsing==3.0.7
+pypi-kenlm==0.1.20210121
+pyrsistent==0.18.1
+pysocks==1.7.1
+python-dateutil==2.8.2
+python-etcd==0.4.5
+python-levenshtein==0.12.2
+python-multipart==0.0.5
+pytz==2021.1
+pyyaml==5.4.1
+pyzmq==22.3.0
+regex==2022.1.18
+requests==2.24.0
+resampy==0.2.2
+ruamel-yaml==0.15.87
+sacremoses==0.0.47
+scikit-learn==1.0.2
+scipy==1.7.3
+send2trash==1.8.0
+sentry-sdk==1.5.4
+setuptools==50.3.1.post20201107
+shortuuid==1.0.8
+six==1.15.0
+smmap==5.0.0
+sniffio==1.2.0
+sortedcontainers==2.4.0
+soundfile==0.10.3.post1
+soupsieve==2.2
+stack-data==0.1.4
+starlette==0.17.1
+termcolor==1.1.0
+terminado==0.13.1
+testpath==0.5.0
+threadpoolctl==3.1.0
+tokenizers==0.11.4
+tomli==2.0.0
+torch==1.10.2
+torchaudio==0.10.2
+torchelastic==0.2.2
+torchtext==0.9.1
+torchvision==0.9.1
+tornado==6.1
+tqdm==4.62.3
+traitlets==5.1.1
+transformers==4.17.0.dev0
+typing-extensions==4.0.1
+urllib3==1.25.11
+uvicorn==0.17.1
+wandb==0.12.10
+wcwidth==0.2.5
+webencodings==0.5.1
+wheel==0.35.1
+widgetsnbextension==3.5.2
+xxhash==2.0.2
+yarl==1.7.2
+yaspin==2.1.0
+zipp==3.7.0

wandb/run-20220206_201634-uhiy9e2t/files/wandb-metadata.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+    "os": "Linux-4.15.0-151-generic-x86_64-with-glibc2.10",
+    "python": "3.8.8",
+    "heartbeatAt": "2022-02-06T20:16:36.272327",
+    "startedAt": "2022-02-06T20:16:34.257348",
+    "docker": null,
+    "gpu": "Tesla V100S-PCIE-32GB",
+    "gpu_count": 1,
+    "cpu_count": 60,
+    "cuda": null,
+    "args": [
+        "--activation_dropout=0.1",
+        "--dataset_name=mozilla-foundation/common_voice_8_0",
+        "--dataset_config_name=fr",
+        "--eval_steps=1000",
+        "--evaluation_strategy=steps",
+        "--feat_proj_dropout=0.0",
+        "--freeze_feature_encoder",
+        "--fp16",
+        "--gradient_accumulation_steps=8",
+        "--gradient_checkpointing",
+        "--group_by_length",
+        "--layerdrop=0.0",
+        "--learning_rate=7.5e-5",
+        "--length_column_name=input_length",
+        "--load_best_model_at_end",
+        "--logging_steps=100",
+        "--mask_feature_length=64",
+        "--mask_feature_prob=0.25",
+        "--mask_time_length=10",
+        "--mask_time_prob=0.75",
+        "--model_name_or_path=./checkpoint-13000",
+        "--num_train_epochs=6.0",
+        "--output_dir=./",
+        "--overwrite_output_dir",
+        "--per_device_train_batch_size=16",
+        "--per_device_eval_batch_size=16",
+        "--preprocessing_num_workers=4",
+        "--push_to_hub",
+        "--report_to=wandb",
+        "--save_steps=1000",
+        "--save_total_limit=3",
+        "--text_column_name=sentence",
+        "--use_auth_token",
+        "--warmup_steps=2000",
+        "--do_train",
+        "--do_eval"
+    ],
+    "state": "running",
+    "program": "run_speech_recognition_ctc.py",
+    "codePath": "run_speech_recognition_ctc.py",
+    "git": {
+        "remote": "https://huggingface.co/Plim/xls-r-1b-cv_8-fr",
+        "commit": "89ae304fd007aa488056ada57d1062398d37739d"
+    },
+    "email": "lim.pascal93@gmail.com",
+    "root": "/workspace/xls-r-1b-cv_8-fr",
+    "host": "job-597becdf-05fc-498e-bdc5-d363b0af8ddd",
+    "username": "ovh",
+    "executable": "/opt/conda/bin/python"
+}

wandb/run-20220206_201634-uhiy9e2t/files/wandb-summary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

wandb/run-20220206_201634-uhiy9e2t/logs/debug-internal.log ADDED Viewed

The diff for this file is too large to render. See raw diff

wandb/run-20220206_201634-uhiy9e2t/logs/debug.log ADDED Viewed

	@@ -0,0 +1,26 @@

+2022-02-06 20:16:34,262 INFO    MainThread:9578 [wandb_setup.py:_flush():75] Loading settings from /workspace/.config/wandb/settings
+2022-02-06 20:16:34,262 INFO    MainThread:9578 [wandb_setup.py:_flush():75] Loading settings from /workspace/xls-r-1b-cv_8-fr/wandb/settings
+2022-02-06 20:16:34,262 INFO    MainThread:9578 [wandb_setup.py:_flush():75] Loading settings from environment variables: {'project': 'xls-r-1b-cv_8-fr'}
+2022-02-06 20:16:34,262 INFO    MainThread:9578 [wandb_setup.py:_flush():75] Inferring run settings from compute environment: {'program_relpath': 'run_speech_recognition_ctc.py', 'program': 'run_speech_recognition_ctc.py'}
+2022-02-06 20:16:34,262 INFO    MainThread:9578 [wandb_init.py:_log_setup():386] Logging user logs to /workspace/xls-r-1b-cv_8-fr/wandb/run-20220206_201634-uhiy9e2t/logs/debug.log
+2022-02-06 20:16:34,263 INFO    MainThread:9578 [wandb_init.py:_log_setup():387] Logging internal logs to /workspace/xls-r-1b-cv_8-fr/wandb/run-20220206_201634-uhiy9e2t/logs/debug-internal.log
+2022-02-06 20:16:34,263 INFO    MainThread:9578 [wandb_init.py:init():420] calling init triggers
+2022-02-06 20:16:34,263 INFO    MainThread:9578 [wandb_init.py:init():425] wandb.init called with sweep_config: {}
+config: {}
+2022-02-06 20:16:34,263 INFO    MainThread:9578 [wandb_init.py:init():471] starting backend
+2022-02-06 20:16:34,263 INFO    MainThread:9578 [backend.py:_multiprocessing_setup():99] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
+2022-02-06 20:16:34,587 INFO    MainThread:9578 [backend.py:ensure_launched():219] starting backend process...
+2022-02-06 20:16:34,912 INFO    MainThread:9578 [backend.py:ensure_launched():224] started backend process with pid: 10249
+2022-02-06 20:16:34,915 INFO    MainThread:9578 [wandb_init.py:init():480] backend started and connected
+2022-02-06 20:16:34,924 INFO    MainThread:9578 [wandb_init.py:init():550] updated telemetry
+2022-02-06 20:16:35,517 INFO    MainThread:9578 [wandb_init.py:init():581] communicating current version
+2022-02-06 20:16:36,068 INFO    MainThread:9578 [wandb_init.py:init():586] got version response
+2022-02-06 20:16:36,068 INFO    MainThread:9578 [wandb_init.py:init():596] communicating run to backend with 30 second timeout
+2022-02-06 20:16:36,262 INFO    MainThread:9578 [wandb_init.py:init():624] starting run threads in backend
+2022-02-06 20:16:36,857 INFO    MainThread:9578 [wandb_run.py:_console_start():1827] atexit reg
+2022-02-06 20:16:36,858 INFO    MainThread:9578 [wandb_run.py:_redirect():1701] redirect: SettingsConsole.REDIRECT
+2022-02-06 20:16:36,859 INFO    MainThread:9578 [wandb_run.py:_redirect():1706] Redirecting console.
+2022-02-06 20:16:36,866 INFO    MainThread:9578 [wandb_run.py:_redirect():1762] Redirects installed.
+2022-02-06 20:16:36,866 INFO    MainThread:9578 [wandb_init.py:init():651] run started, returning control to user process
+2022-02-06 20:16:36,869 INFO    MainThread:9578 [wandb_run.py:_config_callback():966] config_cb None None {'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': 'float32', 'use_bfloat16': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'repetition_penalty': 1.0, 'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 0, 'bad_words_ids': None, 'num_return_sequences': 1, 'chunk_size_feed_forward': 0, 'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': None, 'forced_eos_token_id': None, 'remove_invalid_values': False, 'architectures': ['Wav2Vec2ForCTC'], 'finetuning_task': None, 'id2label': {0: 'LABEL_0', 1: 'LABEL_1'}, 'label2id': {'LABEL_0': 0, 'LABEL_1': 1}, 'tokenizer_class': None, 'prefix': None, 'bos_token_id': 1, 'pad_token_id': 45, 'eos_token_id': 2, 'sep_token_id': None, 'decoder_start_token_id': None, 'task_specific_params': None, 'problem_type': None, '_name_or_path': './checkpoint-13000', 'transformers_version': '4.17.0.dev0', 'feat_extract_dropout': 0.0, 'model_type': 'wav2vec2', 'num_feat_extract_layers': 7, 'hidden_size': 1280, 'feat_extract_norm': 'layer', 'feat_extract_activation': 'gelu', 'conv_dim': [512, 512, 512, 512, 512, 512, 512], 'conv_stride': [5, 2, 2, 2, 2, 2, 2], 'conv_kernel': [10, 3, 3, 3, 3, 2, 2], 'conv_bias': True, 'num_conv_pos_embeddings': 128, 'num_conv_pos_embedding_groups': 16, 'num_hidden_layers': 48, 'intermediate_size': 5120, 'hidden_act': 'gelu', 'num_attention_heads': 16, 'hidden_dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.1, 'feat_proj_dropout': 0.0, 'final_dropout': 0.0, 'layerdrop': 0.0, 'layer_norm_eps': 1e-05, 'initializer_range': 0.02, 'vocab_size': 46, 'do_stable_layer_norm': True, 'use_weighted_layer_sum': False, 'apply_spec_augment': True, 'mask_time_prob': 0.75, 'mask_time_length': 10, 'mask_time_min_masks': 2, 'mask_feature_prob': 0.25, 'mask_feature_length': 64, 'mask_feature_min_masks': 0, 'num_codevectors_per_group': 320, 'num_codevector_groups': 2, 'contrastive_logits_temperature': 0.1, 'feat_quantizer_dropout': 0.0, 'num_negatives': 100, 'codevector_dim': 1024, 'proj_codevector_dim': 1024, 'diversity_loss_weight': 0.1, 'ctc_loss_reduction': 'mean', 'ctc_zero_infinity': False, 'add_adapter': False, 'adapter_kernel_size': 3, 'adapter_stride': 2, 'num_adapter_layers': 3, 'output_hidden_size': 1280, 'classifier_proj_size': 256, 'tdnn_dim': [512, 512, 512, 512, 1500], 'tdnn_kernel': [5, 3, 3, 1, 1], 'tdnn_dilation': [1, 2, 3, 1, 1], 'xvector_output_dim': 512, 'output_dir': './', 'overwrite_output_dir': True, 'do_train': True, 'do_eval': True, 'do_predict': False, 'evaluation_strategy': 'steps', 'prediction_loss_only': False, 'per_device_train_batch_size': 16, 'per_device_eval_batch_size': 16, 'per_gpu_train_batch_size': 'None', 'per_gpu_eval_batch_size': 'None', 'gradient_accumulation_steps': 8, 'eval_accumulation_steps': 'None', 'learning_rate': 7.5e-05, 'weight_decay': 0.0, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 6.0, 'max_steps': -1, 'lr_scheduler_type': 'linear', 'warmup_ratio': 0.0, 'warmup_steps': 2000, 'log_level': -1, 'log_level_replica': -1, 'log_on_each_node': True, 'logging_dir': './runs/Feb06_20-15-02_job-597becdf-05fc-498e-bdc5-d363b0af8ddd', 'logging_strategy': 'steps', 'logging_first_step': False, 'logging_steps': 100, 'logging_nan_inf_filter': True, 'save_strategy': 'steps', 'save_steps': 1000, 'save_total_limit': 3, 'save_on_each_node': False, 'no_cuda': False, 'seed': 42, 'bf16': False, 'fp16': True, 'fp16_opt_level': 'O1', 'half_precision_backend': 'amp', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': 'None', 'local_rank': -1, 'xpu_backend': 'None', 'tpu_num_cores': 'None', 'tpu_metrics_debug': False, 'debug': '[]', 'dataloader_drop_last': False, 'eval_steps': 1000, 'dataloader_num_workers': 0, 'past_index': -1, 'run_name': './', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': 'None', 'load_best_model_at_end': True, 'metric_for_best_model': 'loss', 'greater_is_better': False, 'ignore_data_skip': False, 'sharded_ddp': '[]', 'deepspeed': 'None', 'label_smoothing_factor': 0.0, 'optim': 'adamw_hf', 'adafactor': False, 'group_by_length': True, 'length_column_name': 'input_length', 'report_to': "['wandb']", 'ddp_find_unused_parameters': 'None', 'ddp_bucket_cap_mb': 'None', 'dataloader_pin_memory': True, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': True, 'resume_from_checkpoint': 'None', 'hub_model_id': 'None', 'hub_strategy': 'every_save', 'hub_token': '<HUB_TOKEN>', 'gradient_checkpointing': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': 'None', 'push_to_hub_organization': 'None', 'push_to_hub_token': '<PUSH_TO_HUB_TOKEN>', '_n_gpu': 1, 'mp_parameters': '', 'train_batch_size': 16, 'eval_batch_size': 16}
+2022-02-06 20:16:36,875 INFO    MainThread:9578 [wandb_watch.py:watch():43] Watching

wandb/run-20220206_201634-uhiy9e2t/run-uhiy9e2t.wandb ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:288dff72634d18c03ad9c9cc760f78811e6d92ceb73634d6eb187ac9fe19a74b
+size 15081380