ibraheemmoosa
/

xlmindic-base-uniscript

@@ -31,7 +31,6 @@ tags:
 - transliteration
 widget:
 - text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
-- example_title : 'Rabindranath Tagore'
 co2_eq_emissions:
       emissions: "28.53 in grams of CO2"
@@ -77,7 +76,17 @@ These are the 14 languages we pretrain this model on:
 - Oriya
 - Panjabi
 - Sanskrit
-- Sinhala.
 ## Training procedure
@@ -110,6 +119,8 @@ We evaluated this model on the [IndicGLUE](https://huggingface.co/datasets/indic
 ## Intended uses & limitations
 You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
 be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
 fine-tuned versions on a task that interests you.
@@ -160,7 +171,7 @@ Then you can use this model directly with a pipeline for masked language modelin
 ### Limitations and bias
-Even though train on a comparatively large multilingual corpus the model may exhibit harmful Gender, Ethnic and Political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on this model.
 ### BibTeX entry and citation info

 - transliteration
 widget:
 - text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
 co2_eq_emissions:
       emissions: "28.53 in grams of CO2"
 - Oriya
 - Panjabi
 - Sanskrit
+- Sinhala
+## Transliteration
+The unique component of this model is that it takes in ISO-15919 transliterated text.
+The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
+For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
+This model has been trained on ISO-15919 transliterated text of various Indo-Aryan languages.
 ## Training procedure
 ## Intended uses & limitations
+This model is pretrained on Indo-Aryan languages. Thus it is intended to be used for downstream tasks on these languages. However, since Dravidian languages such as Malayalam, Telegu, Kannada etc share a lot of vocabulary with the Indo-Aryan languages, this model can potentially be used on those languages too (after transliterating the text to ISO-15919).
 You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
 be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
 fine-tuned versions on a task that interests you.
 ### Limitations and bias
+Even though we pretrain on a comparatively large multilingual corpus the model may exhibit harmful gender, ethnic and political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on the model to make decisions.
 ### BibTeX entry and citation info