PereLluis13 commited on
Commit
5d10cba
1 Parent(s): 62f6bdb

Update with pitched up data checkpoint

Browse files
Files changed (2) hide show
  1. README.md +8 -9
  2. pytorch_model.bin +1 -1
README.md CHANGED
@@ -23,7 +23,7 @@ model-index:
23
  metrics:
24
  - name: Test WER
25
  type: wer
26
- value: {wer_result_on_test} #TODO (IMPORTANT): replace {wer_result_on_test} with the WER error rate you achieved on the common_voice test set. It should be in the format XX.XX (don't add the % sign here). **Please** remember to fill out this value after you evaluated your model, so that your model appears on the leaderboard. If you fill out this model card before evaluating your model, please remember to edit the model card afterward to fill in your value
27
  ---
28
 
29
  # Wav2Vec2-Large-XLSR-53-ca
@@ -51,15 +51,15 @@ resampler = torchaudio.transforms.Resample(48_000, 16_000)
51
  # Preprocessing the datasets.
52
  # We need to read the aduio files as arrays
53
  def speech_file_to_array_fn(batch):
54
- \tspeech_array, sampling_rate = torchaudio.load(batch["path"])
55
- \tbatch["speech"] = resampler(speech_array).squeeze().numpy()
56
- \treturn batch
57
 
58
  test_dataset = test_dataset.map(speech_file_to_array_fn)
59
  inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
60
 
61
  with torch.no_grad():
62
- \tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
63
 
64
  predicted_ids = torch.argmax(logits, dim=-1)
65
 
@@ -70,8 +70,7 @@ print("Reference:", test_dataset["sentence"][:2])
70
 
71
  ## Evaluation
72
 
73
- The model can be evaluated as follows on the {language} test data of Common Voice. # TODO: replace #TODO: replace language with your {language}, *e.g.* French
74
-
75
 
76
  ```python
77
  import torch
@@ -134,10 +133,10 @@ def chunked_wer(targets, predictions, chunk_size=None):
134
  print("WER: {:2f}".format(100 * chunked_wer(result["sentence"], result["pred_strings"], chunk_size=4000)))
135
  ```
136
 
137
- **Test Result**: 8.93 %
138
 
139
  ## Training
140
 
141
- The Common Voice `train`, `validation` datasets were used for training. At the second epoch training was halted due to a memory issue, and was continued with lower batch size, but acc. gradient steps were scaled to keep it at 32 batch size during all training.
142
 
143
  The script used for training can be found [here](https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_common_voice.py). Slight modifications were done in order to speed up the ordering by length during training, which can be found [here](https://discuss.huggingface.co/t/spanish-asr-fine-tuning-wav2vec2/4586/6). Another version trained for catalan can be found [here](https://huggingface.co/ccoreilly/wav2vec2-large-xlsr-catala), which may be better than this one since it was trained with extra data and for longer time. Whoever, since it used different splits that include part of the Common Voice test set, this version can be used to get a baseline on the Common Voice dataset.
23
  metrics:
24
  - name: Test WER
25
  type: wer
26
+ value: 8.11
27
  ---
28
 
29
  # Wav2Vec2-Large-XLSR-53-ca
51
  # Preprocessing the datasets.
52
  # We need to read the aduio files as arrays
53
  def speech_file_to_array_fn(batch):
54
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
55
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
56
+ return batch
57
 
58
  test_dataset = test_dataset.map(speech_file_to_array_fn)
59
  inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
60
 
61
  with torch.no_grad():
62
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
63
 
64
  predicted_ids = torch.argmax(logits, dim=-1)
65
 
70
 
71
  ## Evaluation
72
 
73
+ The model can be evaluated as follows on the catalan test data of Common Voice.
 
74
 
75
  ```python
76
  import torch
133
  print("WER: {:2f}".format(100 * chunked_wer(result["sentence"], result["pred_strings"], chunk_size=4000)))
134
  ```
135
 
136
+ **Test Result**: 8.11 %
137
 
138
  ## Training
139
 
140
+ The Common Voice `train`, `validation` datasets were used for training. At the second epoch training was halted due to a memory issue, and was continued with lower batch size, but acc. gradient steps were scaled to keep it at 32 batch size during all training. Then the model was trained for an additional 10 epochs where half the male samples were pitched up.
141
 
142
  The script used for training can be found [here](https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_common_voice.py). Slight modifications were done in order to speed up the ordering by length during training, which can be found [here](https://discuss.huggingface.co/t/spanish-asr-fine-tuning-wav2vec2/4586/6). Another version trained for catalan can be found [here](https://huggingface.co/ccoreilly/wav2vec2-large-xlsr-catala), which may be better than this one since it was trained with extra data and for longer time. Whoever, since it used different splits that include part of the Common Voice test set, this version can be used to get a baseline on the Common Voice dataset.
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:33f4126e70a408874e4e09d62a4486fc877b8c94deb344a34c0b1d70f796d428
3
  size 1262282327
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e454d81120b942ef1ca0f9df4e79e448100112871615441691673a737c0e7c0
3
  size 1262282327