yasserTII commited on
Commit
8372129
1 Parent(s): e5b2e12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -21,7 +21,7 @@ ViSPer is a model for audio visual speech recognition (VSR/AVSR). Trained on 550
21
  # Training details:
22
 
23
  We use our proposed dataset to train a encoder-decoder model in a fully-supervised manner under a multi-lingual setting. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned for all languages combined and have a vocabulary size of 21k.
24
- The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.03. A cosine scheduler with a warm-up of 5 epochs is used for training. The maximum batch size per GPU is set to 2400 video frames.
25
 
26
  # Performance:
27
 
 
21
  # Training details:
22
 
23
  We use our proposed dataset to train a encoder-decoder model in a fully-supervised manner under a multi-lingual setting. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned for all languages combined and have a vocabulary size of 21k.
24
+ The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.1. A cosine scheduler with a warm-up of 5 epochs is used for training. The maximum batch size per GPU is set to 1800 video frames.
25
 
26
  # Performance:
27