SanathNarayan commited on
Commit
3ce76ad
1 Parent(s): 1a7d37d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -5
README.md CHANGED
@@ -24,7 +24,6 @@ We use our proposed dataset to train a encoder-decoder model in a fully-supervis
24
  The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.1. A cosine scheduler with a warm-up of 5 epochs is used for training. The maximum batch size per GPU is set to 1800 video frames.
25
 
26
  # Performance:
27
-
28
  We provide the results of the model on our proposed benchmarks in this table:
29
 
30
  | Language | VSR (WER/CER) | AVSR (WER/CER) |
@@ -36,14 +35,20 @@ We provide the results of the model on our proposed benchmarks in this table:
36
  | English | 49.1 | 8.1 |
37
 
38
  # Broader impact:
 
39
 
40
- In essence, while we hope that ViSPer will open the doors for new research questions and opportunities, and should only be used for this purpose. There are also potential dual use concerns that come with releasing ViSPer (dataset and models), trained on a substantial corpus of multilingual video data. While the technology behind ViSPer offers significant advances in multimodal speech recognition, it should only be used for research purposes.
41
-
42
- ## ViSpeR paper coming soon
 
 
 
 
 
 
43
 
44
  ## Check our VSR related works
45
  ```bash
46
-
47
  @inproceedings{djilali2023lip2vec,
48
  title={Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping},
49
  author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane},
 
24
  The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.1. A cosine scheduler with a warm-up of 5 epochs is used for training. The maximum batch size per GPU is set to 1800 video frames.
25
 
26
  # Performance:
 
27
  We provide the results of the model on our proposed benchmarks in this table:
28
 
29
  | Language | VSR (WER/CER) | AVSR (WER/CER) |
 
35
  | English | 49.1 | 8.1 |
36
 
37
  # Broader impact:
38
+ In essence, while we hope that ViSpeR will open the doors for new research questions and opportunities, and should only be used for this purpose. There are also potential dual use concerns that come with releasing ViSPer (dataset and models), trained on a substantial corpus of multilingual video data. While the technology behind ViSPer offers significant advances in multimodal speech recognition, it should only be used for research purposes.
39
 
40
+ ## ViSpeR paper
41
+ ```bash
42
+ @article{narayan2024visper,
43
+ title={ViSpeR: Multilingual Audio-Visual Speech Recognition},
44
+ author={Narayan, Sanath and Djilali, Yasser Abdelaziz Dahou and Singh, Ankit and Bihan, Eustache Le and Hacid, Hakim},
45
+ journal={arXiv preprint arXiv:2406.00038},
46
+ year={2024}
47
+ }
48
+ ```
49
 
50
  ## Check our VSR related works
51
  ```bash
 
52
  @inproceedings{djilali2023lip2vec,
53
  title={Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping},
54
  author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane},