Harveenchadha
commited on
Commit
•
026bd5f
1
Parent(s):
bf80ac0
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Overview
|
2 |
+
|
3 |
+
We present a CLSRIL-23 (Cross Lingual Speech Representations on Indic Languages), a self supervised learning based audio pre-trained model which learns cross
|
4 |
+
lingual speech representations from raw audio across **23 Indic languages**. It is built on top of wav2vec
|
5 |
+
2.0 which is solved by training a contrastive task over masked latent speech representations and
|
6 |
+
jointly learns the quantization of latents shared across all languages.
|
7 |
+
|
8 |
+
[Arxiv Link](https://arxiv.org/pdf/2107.07402.pdf)
|
9 |
+
|
10 |
+
[Original Repo](https://github.com/Open-Speech-EkStep/vakyansh-models) contains models in fairseq format.
|
11 |
+
|
12 |
+
## Languages in the pretraining dataset
|
13 |
+
|
14 |
+
| Language | Data (In Hrs) |
|
15 |
+
|-----------|---------------|
|
16 |
+
| Assamese | 254.9 |
|
17 |
+
| Bengali | 331.3 |
|
18 |
+
| Bodo | 26.9 |
|
19 |
+
| Dogri | 17.1 |
|
20 |
+
| English | 819.7 |
|
21 |
+
| Gujarati | 336.7 |
|
22 |
+
| Hindi | 4563.7 |
|
23 |
+
| Kannada | 451.8 |
|
24 |
+
| Kashmiri | 67.8 |
|
25 |
+
| Konkani | 36.8 |
|
26 |
+
| Maithili | 113.8 |
|
27 |
+
| Malayalam | 297.7 |
|
28 |
+
| Manipuri | 171.9 |
|
29 |
+
| Marathi | 458.2 |
|
30 |
+
| Nepali | 31.6 |
|
31 |
+
| Odia | 131.4 |
|
32 |
+
| Punjabi | 486.05 |
|
33 |
+
| Sanskrit | 58.8 |
|
34 |
+
| Santali | 6.56 |
|
35 |
+
| Sindhi | 16 |
|
36 |
+
| Tamil | 542.6 |
|
37 |
+
| Telugu | 302.8 |
|
38 |
+
| Urdu | 259.68 |
|
39 |
+
|
40 |
+
## Repo for training:
|
41 |
+
|
42 |
+
[Experimentation](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation) platform built on top of fairseq.
|