Harveenchadha commited on
Commit
026bd5f
1 Parent(s): bf80ac0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Overview
2
+
3
+ We present a CLSRIL-23 (Cross Lingual Speech Representations on Indic Languages), a self supervised learning based audio pre-trained model which learns cross
4
+ lingual speech representations from raw audio across **23 Indic languages**. It is built on top of wav2vec
5
+ 2.0 which is solved by training a contrastive task over masked latent speech representations and
6
+ jointly learns the quantization of latents shared across all languages.
7
+
8
+ [Arxiv Link](https://arxiv.org/pdf/2107.07402.pdf)
9
+
10
+ [Original Repo](https://github.com/Open-Speech-EkStep/vakyansh-models) contains models in fairseq format.
11
+
12
+ ## Languages in the pretraining dataset
13
+
14
+ | Language | Data (In Hrs) |
15
+ |-----------|---------------|
16
+ | Assamese | 254.9 |
17
+ | Bengali | 331.3 |
18
+ | Bodo | 26.9 |
19
+ | Dogri | 17.1 |
20
+ | English | 819.7 |
21
+ | Gujarati | 336.7 |
22
+ | Hindi | 4563.7 |
23
+ | Kannada | 451.8 |
24
+ | Kashmiri | 67.8 |
25
+ | Konkani | 36.8 |
26
+ | Maithili | 113.8 |
27
+ | Malayalam | 297.7 |
28
+ | Manipuri | 171.9 |
29
+ | Marathi | 458.2 |
30
+ | Nepali | 31.6 |
31
+ | Odia | 131.4 |
32
+ | Punjabi | 486.05 |
33
+ | Sanskrit | 58.8 |
34
+ | Santali | 6.56 |
35
+ | Sindhi | 16 |
36
+ | Tamil | 542.6 |
37
+ | Telugu | 302.8 |
38
+ | Urdu | 259.68 |
39
+
40
+ ## Repo for training:
41
+
42
+ [Experimentation](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation) platform built on top of fairseq.