ibraheemmoosa commited on
Commit
7a62925
1 Parent(s): 3c45a9a

Add model description and usage examples.

Browse files
Files changed (1) hide show
  1. README.md +131 -4
README.md CHANGED
@@ -23,16 +23,143 @@ tags:
23
  - masked-language-modeling
24
  - sentence-order-prediction
25
  - fill-mask
 
 
26
  - nlp
 
 
 
 
 
 
 
 
 
27
  ---
28
 
29
  # XLMIndic Base Uniscript
30
 
31
- Pretrained ALBERT model on the OSCAR corpus on the languages Assamese, Bengali, Bihari, Bishnupriya Manipuri,
32
- Goan Konkani, Gujarati, Hindi, Maithili, Marathi, Nepali, Oriya, Panjabi, Sanskrit and Sinhala.
33
- Like ALBERT it was pretrained using as masked language modeling (MLM) and a sentence order prediction (SOP)
34
- objective. This model was pretrained after transliterating the text to ISO-15919 format using the Aksharamukha
35
  library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
36
  where you can transliterate your text and use it on our model on the inference widget.
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
 
23
  - masked-language-modeling
24
  - sentence-order-prediction
25
  - fill-mask
26
+ - xlmindic
27
+ - exbert
28
  - nlp
29
+ widget:
30
+ - text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
31
+
32
+ co2_eq_emissions:
33
+ emissions: "28.53 in grams of CO2"
34
+ source: "calculated using this webstie https://mlco2.github.io/impact/#compute"
35
+ training_type: "pretraining"
36
+ geographical_location: "NA"
37
+ hardware_used: "TPUv3-8 for about 180 hours or 7.5 days"
38
  ---
39
 
40
  # XLMIndic Base Uniscript
41
 
42
+ Pretrained [ALBERT](https://arxiv.org/abs/1909.11942) model on the [OSCAR](https://huggingface.co/datasets/oscar) corpus on the 14 Indo-Aryan languages. This model was pretrained after transliterating the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
 
 
 
43
  library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
44
  where you can transliterate your text and use it on our model on the inference widget.
45
 
46
+ ## Model description
47
+
48
+ This model has the same configuration as the [ALBERT Base v2 model](https://huggingface.co/albert-base-v2/). Specifically, this model has the following configuration:
49
+
50
+ - 12 repeating layers
51
+ - 128 embedding dimension
52
+ - 768 hidden dimension
53
+ - 12 attention heads
54
+ - 11M parameters
55
+
56
+ ## Training data
57
+
58
+ This model was pretrained on the [OSCAR](https://huggingface.co/datasets/oscar) dataset which is a medium sized multilingual corpus containing text from 163 languages. We select a subset of 14 languages based on the following criteria:
59
+ - Belongs to the [Indo-Aryan language family](https://en.wikipedia.org/wiki/Indo-Aryan_languages).
60
+ - Uses a [Brahmic script](https://en.wikipedia.org/wiki/Brahmic_scripts).
61
+
62
+ These are the 14 languages we pretrain this model on:
63
+ - Assamese
64
+ - Bangla
65
+ - Bihari
66
+ - Bishnupriya Manipuri
67
+ - Goan Konkani
68
+ - Gujarati
69
+ - Hindi
70
+ - Maithili
71
+ - Marathi
72
+ - Nepali
73
+ - Oriya
74
+ - Panjabi
75
+ - Sanskrit
76
+ - Sinhala.
77
+
78
+ ## Training procedure
79
+
80
+ ### Preprocessing
81
+
82
+ The texts are transliterated to ISO-15919 format using the Aksharamukha library. Then these are tokenized using SentencePiece and a vocabulary size of 50,000. The inputs of the model are
83
+ then of the form:
84
+ ```
85
+ [CLS] Sentence A [SEP] Sentence B [SEP]
86
+ ```
87
+
88
+ ### Training
89
+
90
+ Training objective is the same as the original ALBERT.
91
+ .
92
+ The details of the masking procedure for each sentence are the following:
93
+ - 15% of the tokens are masked.
94
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
95
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
96
+ - In the 10% remaining cases, the masked tokens are left as is.
97
+
98
+ The details of the sentence order prediction example generation procedure for each sentence are the following:
99
+ - Split the sentence into two parts A and B at a random index.
100
+ - With 50% probability swap the two parts.
101
+
102
+ The model was pretrained on TPUv3-8 for 1M steps. We have checkpoints available every 10k steps. We will upload these in the future.
103
+
104
+ ## Evaluation results
105
+ We evaluated this model on the [IndicGLUE](https://huggingface.co/datasets/indic_glue) benchmark dataset.
106
+
107
+ ## Intended uses & limitations
108
+
109
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
110
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
111
+ fine-tuned versions on a task that interests you.
112
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
113
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
114
+ generation you should look at model like GPT2.
115
+
116
+ ### How to use
117
+
118
+ To use this model you will need to first install the [Aksharamukha](https://pypi.org/project/aksharamukha/) library.
119
+
120
+ ```bash
121
+ pip install aksharamukha
122
+ ```
123
+
124
+ Then you can use this model directly with a pipeline for masked language modeling:
125
+
126
+ ```python
127
+ >>> from transformers import pipeline
128
+ >>> from aksharamukha import transliterate
129
+ >>> unmasker = pipeline('fill-mask', model='ibraheemmoosa/xlmindic-base-uniscript')
130
+ >>> text = "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি [MASK], ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।"
131
+ >>> transliterated_text = transliterate.process('Bengali', 'ISO', text)
132
+ >>> transliterated_text
133
+ 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
134
+ >>> unmasker(transliterated_text)
135
+ [{'score': 0.39705055952072144,
136
+ 'token': 1500,
137
+ 'token_str': 'abhinētā',
138
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli abhinētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
139
+ {'score': 0.20499080419540405,
140
+ 'token': 3585,
141
+ 'token_str': 'kabi',
142
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
143
+ {'score': 0.1314290314912796,
144
+ 'token': 15402,
145
+ 'token_str': 'rājanētā',
146
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli rājanētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
147
+ {'score': 0.060830358415842056,
148
+ 'token': 3212,
149
+ 'token_str': 'kalākāra',
150
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kalākāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
151
+ {'score': 0.035522934049367905,
152
+ 'token': 11586,
153
+ 'token_str': 'sāhityakāra',
154
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli sāhityakāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}]
155
+ ```
156
+
157
+ ### Limitations and bias
158
+
159
+ Even though train on a comparatively large multilingual corpus the model may exhibit harmful Gender, Ethnic and Political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on this model.
160
+
161
+ ### BibTeX entry and citation info
162
+
163
+ Coming soon!
164
+
165