MaximilianChen commited on
Commit
0b2fb31
1 Parent(s): 9741850

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ca
3
+ datasets:
4
+ - CommonVoice
5
+ - ParlamentParla
6
+ ---
7
+
8
+ # Casper: a Catalan Automatic Speech Recognition Model
9
+
10
+ ## Table of Contents
11
+ - [Model Details](#model-details)
12
+ - [Uses](#uses)
13
+ - [Limitations](#risks-limitations-and-biases)
14
+ - [Training](#training)
15
+ - [Evaluation](#evaluation)
16
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
17
+
18
+
19
+ ## Model Details
20
+ - **Model Description:**
21
+ Casper is a state-of-the-art automatic speech recognition (ASR) model for Catalan by finetuning [whisper-small](https://huggingface.co/openai/whisper-small) on Catalan datasets.
22
+ - **Developed by:** Yongjian Chen, Xavier Bonet-Casals, Mireia Farrús.
23
+ - **Model Type:** Transformer-based encoder-decoder model
24
+ - **Language(s):** Catalan
25
+ - **Parent Model:** See the [whisper-small](https://huggingface.co/openai/whisper-small) for more information about the Whisper model.
26
+
27
+
28
+
29
+ ## Uses
30
+
31
+ #### Direct Use
32
+
33
+ This model can be used for Catalan transcription task.
34
+
35
+
36
+ ## Limitations
37
+ - This model does not do punctuation and casing.
38
+ - This model only supports audio samples of up to 30 seconds in duration as an inherent property from the parent model. To transcribe audio samples of more than 30 seconds, an additional chunking algorithm is needed to preprocess the samples.
39
+
40
+ ## Training
41
+ #### Training Data
42
+ This model was fine-tuned on Common Voice and ParlamentParla Ctalan datasets.
43
+
44
+ The [Common Voice](https://labs.mozilla.org/projects/common-voice/) project seeks to provide a platform where everyone can contribute their own voice to an open source multilingual data bank. The model developers used the the version [mozilla-foundation/common_voice_11_0 ca](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) for training.
45
+
46
+ The [ParlamentParla](https://huggingface.co/datasets/projecte-aina/parlament_parla) speech corpus contains audio segments extracted from recordings from the Catalan Parliment plenary sessions between 2007/07/11 - 2018/07/1.
47
+
48
+ #### Training Procedure
49
+
50
+ | Step | Training Loss | Epoch | Validation Loss | Validation WER |
51
+ |:------:|:-------------:|:-----:|:---------------:|:--------------:|
52
+ | 10000 | 0.11 | 0.43 | 0.14 | 6.49% |
53
+ | 20000 | 0.09 | 0.86 | 0.13 | 6.28% |
54
+ | 30000 | 0.05 | 1.28 | 0.13 | 5.91% |
55
+ | 40000 | 0.06 | 1.71 | 0.12 | 5.90% |
56
+ | 50000 | 0.03 | 2.14 | 0.13 | 5.70% |
57
+ | 60000 | 0.03 | 2.57 | 0.13 | 5.82% |
58
+ | 70000 | 0.03 | 3.00 | 0.13 | 5.56% |
59
+ | 80000 | 0.01 | 3.43 | 0.14 | 5.64% |
60
+ | 90000 | 0.01 | 3.85 | 0.14 | 5.59% |
61
+ | 100000 | 0.01 | 4.28 | 0.14 | 5.50% |
62
+ | 110000 | 0.01 | 4.71 | 0.14 | 5.42% |
63
+ | 120000 | 0.01 | 5.14 | 0.15 | 5.83% |
64
+ | 130000 | 0.01 | 5.57 | 0.15 | 5.65% |
65
+ | 140000 | 0.01 | 6.00 | 0.15 | 5.54% |
66
+ | 150000 | 0.003 | 6.42 | 0.15 | 5.56% |
67
+
68
+
69
+ ## Evaluation
70
+
71
+ #### Evaluation Data
72
+ The evaluation dataset was created by the developer Xavier using the webinars from the University of Barcelona and is mostly domain-specific, surrounding topics of linguistics and language policy.
73
+
74
+ The distribution of different specifications in the evaluation set is as follows:
75
+
76
+ | Specification | Category | # | % |
77
+ | ------------- | -------- | ---- | ------ |
78
+ | Register | Formal | 88 | 57.14% |
79
+ | | Informal | 66 | 42.86% |
80
+ | Accent | Central | 33 | 21.43% |
81
+ | | Balearic | 44 | 28.57% |
82
+ | | Valencian| 44 | 28.57% |
83
+ | | Western | 33 | 21.43% |
84
+ | Gender | Male | 66 | 42.86% |
85
+ | | Female | 88 | 57.14% |
86
+
87
+
88
+ #### Evaluation Metrics
89
+ The model developers evaluated Casper using two metrics: Word Error Rate (WER) and BLEU for machine translation (MT) from Catalan to Spanish and to English.
90
+
91
+ Our fine-tuned Whisper model Casper significantly outperforms the zero-shot performance of the pre-trained Whisper model across different specifications on the evaluation dataset and such improvements lead to better outcomes in the MT downstream task.
92
+
93
+ ##### WER
94
+
95
+ | Specification | Category | Whisper-small | Fine-tuned Whisper-small |
96
+ |:-------------:|:--------:|:-------------:|:------------------------:|
97
+ | Register | Formal | 31.21% | 17.71% |
98
+ | | Informal | 53.10% | 22.10% |
99
+ | Accent | Central | 16.38% | 14.39% |
100
+ | | Balearic | 29.76% | 29.68% |
101
+ | | Valencian| 77.28% | 16.15% |
102
+ | | Western | 21.10% | 17.48% |
103
+ | Gender | Male | 57.49% | 15.14% |
104
+ | | Female | 24.60% | 23.39% |
105
+ | Total | / | 40.12% | 19.50% |
106
+
107
+
108
+ ##### BLEU
109
+
110
+ | Language | Target | Correct Transcript | Whisper-small | Fine-tuned whisper-small |
111
+ |:--------:|:-------------------------:|:------------------:|:-------------:|:------------------------:|
112
+ | Spanish | Human Translation | 83.5498 | 54.0836 | 63.7367 |
113
+ | | Machine-assisted Translation | 84.219 | 54.5868 | 63.9436 |
114
+ | English | Human Translation | 32.7 | 29.5 | 30.8 |
115
+ | | Machine-assisted Translation | 33.5 | 30.3 | 31.6 |
116
+
117
+
118
+
119
+
120
+ ## How to Get Started With the Model
121
+
122
+ ```python
123
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration, WhisperConfig
124
+ import torch
125
+ import torchaudio
126
+
127
+ # Load Casper and its processor :
128
+ processor = WhisperProcessor.from_pretrained("maximilianchen/casper")
129
+ model = WhisperForConditionalGeneration.from_pretrained("maximilianchen/casper")
130
+
131
+ # Load an audio sample
132
+ ## Please make sure that the audio sample has been resampled to 16kHz before being loaded
133
+ sa, sr = torchaudio.load(filename)
134
+ sa = sa.squeeze(0)
135
+
136
+ # Convert input audio sample into features
137
+ inputs = processor(sa, sampling_rate=sr, return_tensors="pt").input_features
138
+
139
+ # Generate token ids
140
+ with torch.no_grad():
141
+ generated_ids = model.generate(inputs=inputs)
142
+
143
+ # Decode token ids to text
144
+ transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
145
+ print(transcription)
146
+ ['miraré de destacar molt breument què té d específic i essencial la coordinació aquesta estructura aparentment trivial on diem que coordinem dues categories aparentment iguals què té d especial què té de específic perquè és complicat si té raó o eies per això es parla d equivalència sintàctica i semàntica i llavors el repte és veure exactament què què té de sintàctica què té de semàntica']
147
+
148
+
149
+ ```
150
+
151
+