ibraheemmoosa
commited on
Commit
·
fff120b
1
Parent(s):
51c76d2
Update README.md
Browse files
README.md
CHANGED
@@ -31,7 +31,6 @@ tags:
|
|
31 |
- transliteration
|
32 |
widget:
|
33 |
- text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
|
34 |
-
- example_title : 'Rabindranath Tagore'
|
35 |
|
36 |
co2_eq_emissions:
|
37 |
emissions: "28.53 in grams of CO2"
|
@@ -77,7 +76,17 @@ These are the 14 languages we pretrain this model on:
|
|
77 |
- Oriya
|
78 |
- Panjabi
|
79 |
- Sanskrit
|
80 |
-
- Sinhala
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
## Training procedure
|
83 |
|
@@ -110,6 +119,8 @@ We evaluated this model on the [IndicGLUE](https://huggingface.co/datasets/indic
|
|
110 |
|
111 |
## Intended uses & limitations
|
112 |
|
|
|
|
|
113 |
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
114 |
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
|
115 |
fine-tuned versions on a task that interests you.
|
@@ -160,7 +171,7 @@ Then you can use this model directly with a pipeline for masked language modelin
|
|
160 |
|
161 |
### Limitations and bias
|
162 |
|
163 |
-
Even though
|
164 |
|
165 |
### BibTeX entry and citation info
|
166 |
|
|
|
31 |
- transliteration
|
32 |
widget:
|
33 |
- text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
|
|
|
34 |
|
35 |
co2_eq_emissions:
|
36 |
emissions: "28.53 in grams of CO2"
|
|
|
76 |
- Oriya
|
77 |
- Panjabi
|
78 |
- Sanskrit
|
79 |
+
- Sinhala
|
80 |
+
|
81 |
+
## Transliteration
|
82 |
+
|
83 |
+
The unique component of this model is that it takes in ISO-15919 transliterated text.
|
84 |
+
|
85 |
+
The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
|
86 |
+
|
87 |
+
For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
|
88 |
+
|
89 |
+
This model has been trained on ISO-15919 transliterated text of various Indo-Aryan languages.
|
90 |
|
91 |
## Training procedure
|
92 |
|
|
|
119 |
|
120 |
## Intended uses & limitations
|
121 |
|
122 |
+
This model is pretrained on Indo-Aryan languages. Thus it is intended to be used for downstream tasks on these languages. However, since Dravidian languages such as Malayalam, Telegu, Kannada etc share a lot of vocabulary with the Indo-Aryan languages, this model can potentially be used on those languages too (after transliterating the text to ISO-15919).
|
123 |
+
|
124 |
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
125 |
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
|
126 |
fine-tuned versions on a task that interests you.
|
|
|
171 |
|
172 |
### Limitations and bias
|
173 |
|
174 |
+
Even though we pretrain on a comparatively large multilingual corpus the model may exhibit harmful gender, ethnic and political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on the model to make decisions.
|
175 |
|
176 |
### BibTeX entry and citation info
|
177 |
|