ibraheemmoosa
/

xlmindic-base-uniscript

@@ -23,16 +23,143 @@ tags:
 - masked-language-modeling
 - sentence-order-prediction
 - fill-mask
 - nlp
 ---
 # XLMIndic Base Uniscript
-Pretrained ALBERT model on the OSCAR corpus on the languages Assamese, Bengali, Bihari, Bishnupriya Manipuri,
-Goan Konkani, Gujarati, Hindi, Maithili, Marathi, Nepali, Oriya, Panjabi, Sanskrit and Sinhala.
-Like ALBERT it was pretrained using as masked language modeling (MLM) and a sentence order prediction (SOP)
-objective. This model was pretrained after transliterating the text to ISO-15919 format using the Aksharamukha
 library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
 where you can transliterate your text and use it on our model on the inference widget.

 - masked-language-modeling
 - sentence-order-prediction
 - fill-mask
+- xlmindic
+- exbert
 - nlp
+widget:
+- text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
+co2_eq_emissions:
+      emissions: "28.53 in grams of CO2"
+      source: "calculated using this webstie https://mlco2.github.io/impact/#compute"
+      training_type: "pretraining"
+      geographical_location: "NA"
+      hardware_used: "TPUv3-8 for about 180 hours or 7.5 days"
 ---
 # XLMIndic Base Uniscript
+Pretrained [ALBERT](https://arxiv.org/abs/1909.11942) model on the [OSCAR](https://huggingface.co/datasets/oscar) corpus on the 14 Indo-Aryan languages. This model was pretrained after transliterating the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
 library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
 where you can transliterate your text and use it on our model on the inference widget.
+## Model description
+This model has the same configuration as the [ALBERT Base v2 model](https://huggingface.co/albert-base-v2/). Specifically, this model has the following configuration:
+- 12 repeating layers
+- 128 embedding dimension
+- 768 hidden dimension
+- 12 attention heads
+- 11M parameters
+## Training data
+This model was pretrained on the [OSCAR](https://huggingface.co/datasets/oscar) dataset which is a medium sized multilingual corpus containing text from 163 languages. We select a subset of 14 languages based on the following criteria:
+ - Belongs to the [Indo-Aryan language family](https://en.wikipedia.org/wiki/Indo-Aryan_languages).
+ - Uses a [Brahmic script](https://en.wikipedia.org/wiki/Brahmic_scripts).
+These are the 14 languages we pretrain this model on:
+- Assamese
+- Bangla
+- Bihari
+- Bishnupriya Manipuri
+- Goan Konkani
+- Gujarati
+- Hindi
+- Maithili
+- Marathi
+- Nepali
+- Oriya
+- Panjabi
+- Sanskrit
+- Sinhala.
+## Training procedure
+### Preprocessing
+The texts are transliterated to ISO-15919 format using the Aksharamukha library. Then these are tokenized using SentencePiece and a vocabulary size of 50,000. The inputs of the model are
+then of the form:
+```
+[CLS] Sentence A [SEP] Sentence B [SEP]
+```
+### Training
+Training objective is the same as the original ALBERT.
+.
+The details of the masking procedure for each sentence are the following:
+- 15% of the tokens are masked.
+- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
+- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
+- In the 10% remaining cases, the masked tokens are left as is.
+The details of the sentence order prediction example generation procedure for each sentence are the following:
+- Split the sentence into two parts A and B at a random index.
+- With 50% probability swap the two parts.
+The model was pretrained on TPUv3-8 for 1M steps. We have checkpoints available every 10k steps. We will upload these in the future.
+## Evaluation results
+We evaluated this model on the [IndicGLUE](https://huggingface.co/datasets/indic_glue) benchmark dataset.
+## Intended uses & limitations
+You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
+be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
+fine-tuned versions on a task that interests you.
+Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
+to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
+generation you should look at model like GPT2.
+### How to use
+To use this model you will need to first install the [Aksharamukha](https://pypi.org/project/aksharamukha/) library.
+```bash
+pip install aksharamukha
+```
+Then you can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> from aksharamukha import transliterate
+>>> unmasker = pipeline('fill-mask', model='ibraheemmoosa/xlmindic-base-uniscript')
+>>> text = "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি [MASK], ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।"
+>>> transliterated_text = transliterate.process('Bengali', 'ISO', text)
+>>> transliterated_text
+'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
+>>> unmasker(transliterated_text)
+[{'score': 0.39705055952072144,
+  'token': 1500,
+  'token_str': 'abhinētā',
+  'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli abhinētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
+ {'score': 0.20499080419540405,
+  'token': 3585,
+  'token_str': 'kabi',
+  'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
+ {'score': 0.1314290314912796,
+  'token': 15402,
+  'token_str': 'rājanētā',
+  'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli rājanētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
+ {'score': 0.060830358415842056,
+  'token': 3212,
+  'token_str': 'kalākāra',
+  'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kalākāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
+ {'score': 0.035522934049367905,
+  'token': 11586,
+  'token_str': 'sāhityakāra',
+  'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli sāhityakāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}]
+```
+### Limitations and bias
+Even though train on a comparatively large multilingual corpus the model may exhibit harmful Gender, Ethnic and Political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on this model.
+### BibTeX entry and citation info
+Coming soon!