yasserTII commited on
Commit
bd3afe7
1 Parent(s): c769a3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -2
README.md CHANGED
@@ -8,7 +8,33 @@ language:
8
  - ar
9
  - cz
10
  inference: false
11
- license: unknown
12
  metrics:
13
  - wer
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - ar
9
  - cz
10
  inference: false
11
+ license: mit
12
  metrics:
13
  - wer
14
+ ---
15
+
16
+
17
+ # 🚀 ViSPer
18
+
19
+ ViSPer is a model for audio visual speech recognition (VSR/AVSR). Trained on 5500 hours of labelled video data.
20
+
21
+ # Training details:
22
+
23
+ We use our proposed dataset to train a encoder-decoder model in a fully-supervised manner under a multi-lingual setting. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned for all languages combined and have a vocabulary size of 21k.
24
+ The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.03. A cosine scheduler with a warm-up of 5 epochs is used for training. The maximum batch size per GPU is set to 2400 video frames.
25
+
26
+ # Performance:
27
+
28
+ We provide the results of the model on our proposed benchmarks in this table:
29
+
30
+ | Language | VSR (WER/CER) | AVSR (WER/CER) |
31
+ |----------|---------------|----------------|
32
+ | French | 29.8 | 5.7 |
33
+ | Spanish | 39.4 | 4.4 |
34
+ | Arabic | 47.8 | 8.4 |
35
+ | Chinese | 51.3 (CER) | 15.4 (CER) |
36
+ | English | 49.1 | 8.1 |
37
+
38
+ # Broader impact:
39
+
40
+ In essence, while we hope that ViSPer will open the doors for new research questions and opportunities, and should only be used for this purpose. There are also potential dual use concerns that come with releasing ViSPer (dataset and models), trained on a substantial corpus of multilingual video data. While the technology behind ViSPer offers significant advances in multiomodal speech recognition, its deployment raises important considerations in terms of societal impacts and potential misuse.