malmarz commited on
Commit
9db44a0
1 Parent(s): c57e109

update model card README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -87
README.md CHANGED
@@ -1,104 +1,65 @@
1
- # whisper_sprint
 
 
 
 
 
 
 
 
 
2
 
3
- ## training
 
4
 
5
- ```bash
6
- git clone https://github.com/ARBML/whisper_sprint
7
- cd whisper_sprint
8
- ```
9
 
10
- Then setup the enviornment
 
 
 
11
 
12
- ```bash
13
- bash setup_env.sh
14
- ```
15
 
16
- Then setup the libraries, this will install transofrmers, etc. and create a directory in the hub for training the model ...
17
 
18
- ```bash
19
- bash setup_libs.sh HF_USER_NAME MODEL_NAME
20
- ```
21
 
22
- After that, you can run training by
23
 
24
- ```
25
- cd MODEL_NAME
26
- bash run_mgb2.sh
27
- ```
28
 
29
- You can also run with deepspeed wich allows running whisper-large v2 with batch size 32 on A100
30
 
31
- ```
32
- bash run_mgb2_deepspeed.sh
33
- ```
34
 
35
- ## Evaluation
36
 
37
- ### Evaluation on Fleurs
 
 
 
 
 
 
 
 
 
38
 
39
- ```
40
- bash run_eval_fleurs.sh MODEL_NAME
41
- ```
42
 
43
- ### Evaluation on Common Voice 11
 
 
 
 
 
 
44
 
45
- ```
46
- bash run_eval_cv_11.sh MODEL_NAME
47
- ```
48
 
 
49
 
50
-
51
- evaluate on common voice 11
52
-
53
- ```bash
54
- bash run_eval_cv_11.sh HF_USER_NAME/MODEL_NAME
55
- ```
56
-
57
- evaluate on Fleurs
58
-
59
- ```bash
60
- bash run_eval_fleurs.sh HF_USER_NAME/MODEL_NAME
61
- ```
62
-
63
- ## Preparing the MGB2 data
64
-
65
- While MGB2 dataset contains a richly transcribed speech dataset, the wav files were too lengthy to be used to train the whisper model. Therefore, we had to split the wave file and still maintain the correct correspondence with the transcribed text.
66
-
67
- MGB2 provides and XML file corresponding to every wav file, which contains the transcribed sentences and the start and end time of each sentence in the recording. Using the `split_xml_mgb2.py`, we start with the xml file and split the lengthy wav files into smaller ones that are shorter than 30 seconds in length, as required to fine-tune whisper. The operation produced over 370K sentences with their corresponding wav files.
68
-
69
- ## Hosting on HuggingFace (Privately)
70
-
71
- To host mgb2 at HF, at least 3 things need to happen:
72
-
73
- 1. Create the dataset repository on HF. This was created privately at arbml/mgb2_speech for the dataset
74
- 2. Data must be hosted somewhere or uploaded to HF repo
75
- 3. HF loading script must be written so the data can be integrated into the HF hub.
76
-
77
-
78
- ### Uploading the data
79
-
80
- The dataset was >100Gb in size. HF utilizes git lfs to host large files. However, git lfs has a max limit of 5gb size for any file. Uploading over 370K individual files was also not feasible and caused issues with git.
81
- Therefore, the solution was to archive groups of wav files together into sequentially numbered archive files, such that the archive file is no bigger than 5GB. To achieve that, the wav files were grouped based on the first 2 letters of the file name. The naming scheme seems to use a base64 encoding. So, characters would be 0 to 9 or A to F. The files were grouped as follows:
82
-
83
- | First 2 Letters | Archive Number |
84
- |:-:|---|
85
- | 00-05 | 0 |
86
- | 06-09 | 1 |
87
- | 0A-0F | 2 |
88
- | 10-15 | 3 |
89
- | 16-19 | 4 |
90
- | 1A-1F | 5 |
91
- | ... | ... |
92
- | F0-F5 | 45 |
93
- | F6-F9 | 46 |
94
- | FA-FF | 47 |
95
-
96
- Only the training data was split using this scheme, the test and validation data was smaller than 5GB when archived.
97
-
98
- ### HF Data Loading Script
99
-
100
- The loading script determines the features of the data based on split and selected configuration. We had test, dev, and train split with a single language configuration. Using the _generate_example function, the script is used by GH to correctly produce the associated transcript and wav files. The function works as follows:
101
-
102
- 1. Go through all the entries in the archive containing the text transcripts and create a map where the name of the file (the 64base encoded one) is used as the key and the transcript at the value
103
- 2. Iterate through all the wav files in all the archive, and for every wav file, get the corresponding transcript from the map constructed in previous step (using the file name) and yield the wav file, transcript, and path to the wav file
104
-
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - generated_from_trainer
5
+ metrics:
6
+ - wer
7
+ model-index:
8
+ - name: openai/whisper-small
9
+ results: []
10
+ ---
11
 
12
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
+ should probably proofread and complete it, then remove this comment. -->
14
 
15
+ # openai/whisper-small
 
 
 
16
 
17
+ This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the None dataset.
18
+ It achieves the following results on the evaluation set:
19
+ - Loss: 0.4429
20
+ - Wer: 52.7568
21
 
22
+ ## Model description
 
 
23
 
24
+ More information needed
25
 
26
+ ## Intended uses & limitations
 
 
27
 
28
+ More information needed
29
 
30
+ ## Training and evaluation data
 
 
 
31
 
32
+ More information needed
33
 
34
+ ## Training procedure
 
 
35
 
36
+ ### Training hyperparameters
37
 
38
+ The following hyperparameters were used during training:
39
+ - learning_rate: 1e-05
40
+ - train_batch_size: 64
41
+ - eval_batch_size: 8
42
+ - seed: 42
43
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
44
+ - lr_scheduler_type: linear
45
+ - lr_scheduler_warmup_steps: 500
46
+ - training_steps: 5000
47
+ - mixed_precision_training: Native AMP
48
 
49
+ ### Training results
 
 
50
 
51
+ | Training Loss | Epoch | Step | Validation Loss | Wer |
52
+ |:-------------:|:-----:|:----:|:---------------:|:-------:|
53
+ | 0.3629 | 1.03 | 1000 | 0.4917 | 53.1291 |
54
+ | 0.289 | 2.06 | 2000 | 0.4747 | 61.3855 |
55
+ | 0.2996 | 3.08 | 3000 | 0.4542 | 55.4692 |
56
+ | 0.2331 | 4.11 | 4000 | 0.4353 | 51.4917 |
57
+ | 0.1566 | 5.14 | 5000 | 0.4429 | 52.7568 |
58
 
 
 
 
59
 
60
+ ### Framework versions
61
 
62
+ - Transformers 4.26.0.dev0
63
+ - Pytorch 1.13.0+cu117
64
+ - Datasets 2.7.1.dev0
65
+ - Tokenizers 0.13.2