Merge branch 'main' of https://huggingface.co/KBLab/wav2vec2-large-voxrex-swedish into main

Browse files

Files changed (4) hide show

README.md +16 -9
chart_1.svg +0 -0
comparison.png +0 -0
tokenizer_config.json +9 -1

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-language: sv-SE
 datasets:
 - common_voice
 - NST Swedish ASR Database
@@ -10,11 +10,11 @@ tags:
 - audio
 - automatic-speech-recognition
 - speech
-license: cc0
 model-index:
 - name: Wav2vec 2.0 large VoxRex Swedish
   results:
-  - task:
       name: Speech Recognition
       type: automatic-speech-recognition
     dataset:
@@ -22,18 +22,25 @@ model-index:
       type: common_voice
       args: sv-SE
     metrics:
-       - name: Test WER
-         type: wer
-         value: 9.914
 ---
-# Wav2vec 2.0 large VoxRex Swedish
 Finetuned version of KBs [VoxRex large](https://huggingface.co/KBLab/wav2vec2-large-voxrex) model using Swedish radio broadcasts, NST and Common Voice data. Evalutation without a language model gives the following: WER for NST + Common Voice test set (2% of total sentences) is **3.617%**. WER for Common Voice test set is **9.914%** directly and **7.77%** with a 4-gram language model.
 When using this model, make sure that your speech input is sampled at 16kHz.
 ## Training
-This model has additionally pretrained on 3500h of a mix of Swedish local radio broadcasts, audio books and other audio sources. It has been fine-tuned for 120000 updates on NST + CommonVoice and then for an additional 20000 updates on CommonVoice only. The additional fine-tuning on CommonVoice hurts performance on the NST+CommonVoice test set somewhat and, unsurprisingly, improves it on the CommonVoice test set. It seems to perform generally better though [citation needed].
 ![WER during training](chart_1.svg "WER")
@@ -46,7 +53,7 @@ from datasets import load_dataset
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
 test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]").
 processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
-model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxpopuli-sv-swedish")
 resampler = torchaudio.transforms.Resample(48_000, 16_000)
 # Preprocessing the datasets.
 # We need to read the aduio files as arrays

 ---
+language: sv
 datasets:
 - common_voice
 - NST Swedish ASR Database
 - audio
 - automatic-speech-recognition
 - speech
+license: cc0-1.0
 model-index:
 - name: Wav2vec 2.0 large VoxRex Swedish
   results:
+  - task:
       name: Speech Recognition
       type: automatic-speech-recognition
     dataset:
       type: common_voice
       args: sv-SE
     metrics:
+    - name: Test WER
+      type: wer
+      value: 9.914
 ---
+# Wav2vec 2.0 large VoxRex Swedish (B)
+**Disclaimer:** This is a work in progress. See [VoxRex](https://huggingface.co/KBLab/wav2vec2-large-voxrex) for more details.
 Finetuned version of KBs [VoxRex large](https://huggingface.co/KBLab/wav2vec2-large-voxrex) model using Swedish radio broadcasts, NST and Common Voice data. Evalutation without a language model gives the following: WER for NST + Common Voice test set (2% of total sentences) is **3.617%**. WER for Common Voice test set is **9.914%** directly and **7.77%** with a 4-gram language model.
 When using this model, make sure that your speech input is sampled at 16kHz.
+# Performance\*
+![Comparison](comparison.png "Comparison")
+<center>*<i>Chart shows performance without the additional 20k steps of Common Voice fine-tuning</i></center>
 ## Training
+This model has been fine-tuned for 120000 updates on NST + CommonVoice and then for an additional 20000 updates on CommonVoice only. The additional fine-tuning on CommonVoice hurts performance on the NST+CommonVoice test set somewhat and, unsurprisingly, improves it on the CommonVoice test set. It seems to perform generally better though [citation needed].
 ![WER during training](chart_1.svg "WER")
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
 test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]").
 processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
+model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
 resampler = torchaudio.transforms.Resample(48_000, 16_000)
 # Preprocessing the datasets.
 # We need to read the aduio files as arrays

chart_1.svg CHANGED Viewed

comparison.png ADDED Viewed

tokenizer_config.json CHANGED Viewed

	@@ -1 +1,9 @@
1	- {~~"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "<pad>", "do_lower_case": true, "word_delimiter_token": "\|", "tokenizer_class": "Wav2Vec2CTCTokenizer"}~~

+{
+   "bos_token" : "<s>",
+   "do_lower_case" : true,
+   "eos_token" : "</s>",
+   "pad_token" : "<pad>",
+   "tokenizer_class" : "Wav2Vec2CTCTokenizer",
+   "unk_token" : "<unk>",
+   "word_delimiter_token" : "|"
+}