ibraheemmoosa
/

xlmindic-base-multiscript

@@ -23,11 +23,133 @@ tags:
 - masked-language-modeling
 - sentence-order-prediction
 - fill-mask
 - nlp
 ---
 # XLMIndic Base Multiscript
-Pretrained ALBERT model on the OSCAR corpus on the languages Assamese, Bengali, Bihari, Bishnupriya Manipuri,
-Goan Konkani, Gujarati, Hindi, Maithili, Marathi, Nepali, Oriya, Panjabi, Sanskrit and Sinhala.
-Like ALBERT it was pretrained using as masked language modeling (MLM) and a sentence order prediction (SOP) objective.

 - masked-language-modeling
 - sentence-order-prediction
 - fill-mask
+- xlmindic
 - nlp
+- indoaryan
+- indicnlp
+- iso15919
+widget:
+- text : 'রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি [MASK], ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।'
+co2_eq_emissions:
+      emissions: "28.53 in grams of CO2"
+      source: "calculated using this webstie https://mlco2.github.io/impact/#compute"
+      training_type: "pretraining"
+      geographical_location: "NA"
+      hardware_used: "TPUv3-8 for about 180 hours or 7.5 days"
 ---
 # XLMIndic Base Multiscript
+This model is identical in all aspects to [this model](https://huggingface.co/ibraheemmoosa/xlmindic-base-uniscript) except that we do not perform the ISO-15919 transliteration. Thus it is intended to serve as an ablation model for our study. See [this](https://huggingface.co/ibraheemmoosa/xlmindic-base-uniscript) to understand the details.
+## Model description
+This model has the same configuration as the [ALBERT Base v2 model](https://huggingface.co/albert-base-v2/). Specifically, this model has the following configuration:
+- 12 repeating layers
+- 128 embedding dimension
+- 768 hidden dimension
+- 12 attention heads
+- 11M parameters
+- 512 sequence length
+## Training data
+This model was pretrained on the [OSCAR](https://huggingface.co/datasets/oscar) dataset which is a medium sized multilingual corpus containing text from 163 languages. We select a subset of 14 languages based on the following criteria:
+ - Belongs to the [Indo-Aryan language family](https://en.wikipedia.org/wiki/Indo-Aryan_languages).
+ - Uses a [Brahmic script](https://en.wikipedia.org/wiki/Brahmic_scripts).
+These are the 14 languages we pretrain this model on:
+- Assamese
+- Bangla
+- Bihari
+- Bishnupriya Manipuri
+- Goan Konkani
+- Gujarati
+- Hindi
+- Maithili
+- Marathi
+- Nepali
+- Oriya
+- Panjabi
+- Sanskrit
+- Sinhala
+## Training procedure
+### Preprocessing
+The texts are  tokenized using SentencePiece and a vocabulary size of 50,000. The inputs of the model are
+then of the form:
+```
+[CLS] Sentence A [SEP] Sentence B [SEP]
+```
+### Training
+Training objective is the same as the original ALBERT.
+.
+The details of the masking procedure for each sentence are the following:
+- 15% of the tokens are masked.
+- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
+- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
+- In the 10% remaining cases, the masked tokens are left as is.
+The details of the sentence order prediction example generation procedure for each sentence are the following:
+- Split the sentence into two parts A and B at a random index.
+- With 50% probability swap the two parts.
+The model was pretrained on TPUv3-8 for 1M steps. We have checkpoints available every 10k steps. We will upload these in the future.
+## Evaluation results
+We evaluated this model on the Indo-Aryan subset of languages (Panjabi, Oriya, Assamese, Bangla, Hindi, Marathi, Gujarati) from the [IndicGLUE](https://huggingface.co/datasets/indic_glue) benchmark dataset. We report the mean and standard deviation of nine fine-tuning runs for this model.
+### IndicGLUE
+Task | mBERT | XLM-R | IndicBERT-Base | XLMIndic-Base-Uniscript | XLMIndic-Base-Multiscript (This Model)
+-----| ----- | ----- | ------ | ------- | --------
+Wikipedia Section Title Prediction| 71.90 | 65.45 | 69.40 | **81.78 ± 0.60** | 77.17 ± 0.76
+Article Genre Classification | 88.64 | 96.61 | 97.72 | **98.70 ± 0.29** | 98.30 ± 0.26
+Named Entity Recognition (F1-score) | 71.29 | 62.18 | 56.69 | **89.85 ± 1.14** |  83.19 ± 1.58
+BBC Hindi News Article Classification | 60.55 | 75.52 | 74.60 | **79.14 ± 0.60** | 77.28 ± 1.50
+Soham Bangla News Article Classification | 80.23 | 87.6 | 78.45 | **93.89 ± 0.48** | 93.22 ± 0.49
+INLTK Gujarati Headlines | - | - | **92.91** | 90.73 ± 0.75 | 90.41 ± 0.69
+INLTK Marathi Headlines | - | - | **94.30** | 92.04 ± 0.47 | 92.21 ± 0.23
+IITP Hindi Product Reviews Sentiment Classification | 74.57 | **78.97** | 71.32 | 77.18 ± 0.77 | 76.33 ± 0.84
+IITP Hindi Movie Reviews Sentiment Classification | 56.77 | 61.61 | 59.03 | **66.34 ± 0.16** | 65.91 ± 2.20
+MIDAS Hindi Discourse Classification | 71.20 | **79.94** | 78.44 | 78.54 ± 0.91 | 78.39 ± 0.33
+Cloze Style Question Answering | - | - | 37.16 | **41.54** | 38.21
+## Intended uses & limitations
+This model is pretrained on Indo-Aryan languages. Thus it is intended to be used for downstream tasks on these languages.
+You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
+be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
+fine-tuned versions on a task that interests you.
+Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
+to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
+generation you should look at model like GPT2.
+### How to use
+Then you can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='ibraheemmoosa/xlmindic-base-uniscript')
+>>> text = "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি [MASK], ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।"
+>>> unmasker(text)
+[{'score': 0.34163928031921387,
+  'token': 5399,
+  'token_str': 'কবি',
+  'sequence': 'রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, পন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।'},
+ {'score': 0.30519795417785645,
+  'token': 33436,
+  'token_str': 'people',
+  'sequence': 'রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি people, পন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।'},
+ {'score': 0.29130080342292786,
+  'token': 30476,
+  'token_str': 'সাহিত্যিক',
+  'sequence': 'রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি সাহিত্যিক, পন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য ত��নি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।'},
+ {'score': 0.031051287427544594,
+  'token': 6139,
+  'token_str': 'লেখক',
+  'sequence': 'রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি লেখক, পন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।'},
+ {'score': 0.002705035964027047,
+  'token': 38443,
+  'token_str': 'শিল্পীরা',
+  'sequence': 'রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি শিল্পীরা, পন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।'}]
+```
+### Limitations and bias
+Even though we pretrain on a comparatively large multilingual corpus the model may exhibit harmful gender, ethnic and political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on the model to make decisions.
+### BibTeX entry and citation info
+Coming soon!