vumichien commited on
Commit
346b014
1 Parent(s): 85cf60e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -0
README.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: audio-visual-to-text
6
+ datasets:
7
+ - LRS3
8
+
9
+ tags:
10
+ - Audio Visual to Text
11
+ ---
12
+
13
+ ## Model Description
14
+
15
+ These are model weights originally provided by the authors of the paper [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://arxiv.org/pdf/2201.02184.pdf).
16
+
17
+ Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip
18
+ movements and the produced sound. Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for
19
+ audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT
20
+ learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
21
+
22
+ ## Datasets
23
+ The authors trained the model on lip-reading benchmark LRS3 datasets (433 hours).