elsayedissa commited on
Commit
558ea70
1 Parent(s): ada48e6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - generated_from_trainer
5
+ metrics:
6
+ - wer
7
+ model-index:
8
+ - name: whisper-large-v2-arabic-5k-steps
9
+ results: []
10
+ datasets:
11
+ - mozilla-foundation/common_voice_11_0
12
+ language:
13
+ - ar
14
+ ---
15
+
16
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
+ should probably proofread and complete it, then remove this comment. -->
18
+
19
+ # whisper-large-v2-arabic-5k-steps
20
+
21
+ This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) on the Arabic CommonVoice dataset (v11).
22
+ It achieves the following results on the evaluation set:
23
+ - Loss: 0.3434
24
+ - Wer: 0.4239
25
+
26
+ ## Model description
27
+
28
+ This model is finetuned for 5000 steps for research purposes which means that the transcriptions might not be that satisfactory for users.
29
+
30
+ ## Training and evaluation data
31
+
32
+ - Training Data: CommonVoice (v11) train split
33
+ - Validation Data: CommonVoice (v11) Validation split
34
+ - Test Data: CommonVoice (v11) Test split
35
+
36
+ ## Training procedure
37
+
38
+ ### Training hyperparameters
39
+
40
+ The following hyperparameters were used during training:
41
+ - learning_rate: 1e-05
42
+ - train_batch_size: 50
43
+ - eval_batch_size: 16
44
+ - seed: 42
45
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
46
+ - lr_scheduler_type: linear
47
+ - lr_scheduler_warmup_steps: 500
48
+ - training_steps: 5000
49
+ - mixed_precision_training: Native AMP
50
+
51
+ ### Training results
52
+
53
+ | Training Loss | Epoch | Step | Validation Loss | Wer |
54
+ |:-------------:|:-----:|:----:|:---------------:|:------:|
55
+ | 0.1638 | 1.78 | 1000 | 0.2295 | 0.4410 |
56
+ | 0.0587 | 3.57 | 2000 | 0.2337 | 0.4272 |
57
+ | 0.0125 | 5.35 | 3000 | 0.2745 | 0.4208 |
58
+ | 0.004 | 7.13 | 4000 | 0.3124 | 0.4252 |
59
+ | 0.0016 | 8.91 | 5000 | 0.3434 | 0.4239 |
60
+
61
+ ### Transcription:
62
+
63
+ ```python
64
+ from datasets import load_dataset, Audio
65
+ import torch
66
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
67
+
68
+ # device
69
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
70
+
71
+ # load the model
72
+ processor = WhisperProcessor.from_pretrained("clu-ling/whisper-large-v2-arabic-5k-steps")
73
+ model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-large-v2-arabic-5k-steps").to(device)
74
+ forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
75
+
76
+ # load the dataset
77
+ commonvoice_eval = load_dataset("mozilla-foundation/common_voice_11_0", "ar", split="validation", streaming=True)
78
+ commonvoice_eval = commonvoice_eval.cast_column("audio", Audio(sampling_rate=16000))
79
+ sample = next(iter(commonvoice_eval))["audio"]
80
+
81
+ # features and generate token ids
82
+ input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
83
+ predicted_ids = model.generate(input_features.to(device), forced_decoder_ids=forced_decoder_ids)
84
+
85
+ # decode
86
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
87
+
88
+ print("Transcription:", transcription)
89
+ Transcription: عمي هو أخو أبي.
90
+ ```
91
+
92
+ ### Evaluation:
93
+
94
+ Evaluates this model on `mozilla-foundation/common_voice_11_0` test split.
95
+
96
+ ```python
97
+ import pyarabic.araby as araby
98
+ from transformers.models.whisper.english_normalizer import BasicTextNormalizer
99
+ from datasets import load_dataset, Audio
100
+ import evaluate
101
+ import torch
102
+ import re
103
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
104
+
105
+ # device
106
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
107
+
108
+ # metric
109
+ wer_metric = evaluate.load("wer")
110
+
111
+ # model
112
+ processor = WhisperProcessor.from_pretrained("clu-ling/whisper-large-v2-arabic-5k-steps")
113
+ model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-large-v2-arabic-5k-steps")
114
+
115
+ # dataset
116
+ dataset = load_dataset("mozilla-foundation/common_voice_11_0", "ar", split="test", ) #cache_dir=args.cache_dir
117
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
118
+
119
+ #for debuggings: it gets two examples
120
+ #dataset = dataset.shard(num_shards=10000, index=0)
121
+ #print(dataset)
122
+
123
+ def clean_text(text):
124
+ """Normalizes TRANSCRIPT"""
125
+ text = re.sub(r'[\,\?\.\!\-\;\:\"\“\%\٪\‘\”\�\«\»\،\.\:\؟\؛\*\>\<]', '', text) + " " # special characters
126
+ text = re.sub(r'http\S+', '', text) + " " # links
127
+ text = re.sub(r'[\[\]\(\)\-\/\{\}]', '', text) + " " # brackets
128
+ text = re.sub(r'\s+', ' ', text) + " " # extra white space
129
+ text = araby.strip_diacritics(text) # remove diacrirics
130
+ return text.strip()
131
+
132
+ def normalize(batch):
133
+ """Normalizes GOLD"""
134
+ #batch["gold_text"] = whisper_norm(batch['sentence'])
135
+ batch["gold_text"] = clean_text(batch['sentence'])
136
+ return batch
137
+
138
+ def map_wer(batch):
139
+ model.to(device)
140
+ forced_decoder_ids = processor.get_decoder_prompt_ids(language = "ar", task = "transcribe")
141
+ inputs = processor(batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"], return_tensors="pt").input_features
142
+ with torch.no_grad():
143
+ generated_ids = model.generate(inputs=inputs.to(device), forced_decoder_ids=forced_decoder_ids)
144
+ transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
145
+ batch["predicted_text"] = clean_text(transcription)
146
+ return batch
147
+
148
+ # process GOLD text
149
+ processed_dataset = dataset.map(normalize)
150
+ # get predictions
151
+ predicted = processed_dataset.map(map_wer)
152
+
153
+ # word error rate
154
+ wer = wer_metric.compute(references=predicted['gold_text'], predictions=predicted['predicted_text'])
155
+ wer = round(100 * wer, 2)
156
+ print("WER:", wer)
157
+ ```
158
+
159
+ ### Framework versions
160
+
161
+ - Transformers 4.26.0.dev0
162
+ - Pytorch 1.13.1
163
+ - Datasets 2.8.1.dev0
164
+ - Tokenizers 0.13.2