ibraheemmoosa commited on
Commit
c1feafb
·
1 Parent(s): 4a84c48

Soham model for seed 109

Browse files
README.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - as
4
+ - bn
5
+ - gu
6
+ - hi
7
+ - mr
8
+ - ne
9
+ - or
10
+ - pa
11
+ - si
12
+ - sa
13
+ - bpy
14
+ - mai
15
+ - bh
16
+ - gom
17
+ license: apache-2.0
18
+ datasets:
19
+ - oscar
20
+ tags:
21
+ - multilingual
22
+ - albert
23
+ - xlmindic
24
+ - nlp
25
+ - indoaryan
26
+ - indicnlp
27
+ - iso15919
28
+ - transliteration
29
+ - text-classification
30
+ widget:
31
+ - text : 'cīnēra madhyāñcalē āraō ēkaṭi śaharēra bāsindārā ābāra gharabandī haẏē paṛēchēna. āja maṅgalabāra natuna karē lakaḍāuna–saṁkrānta bidhiniṣēdha jāri haōẏāra para gharē āṭakā paṛēchēna tām̐rā. karōnāra ati saṁkrāmaka natuna dharana amikranēra bistāra ṭhēkātē ēmana padakṣēpa niẏēchē kartr̥pakṣa. khabara bārtā saṁsthā ēēphapira.'
32
+
33
+ co2_eq_emissions:
34
+ emissions: "0.21 in grams of CO2"
35
+ source: "calculated using this webstie https://mlco2.github.io/impact/#compute"
36
+ training_type: "fine-tuning"
37
+ geographical_location: "NA"
38
+ hardware_used: "P100 for about 1.5 hours"
39
+ ---
40
+
41
+ # XLMIndic Base Uniscript
42
+
43
+ This model is finetuned from [this model](https://huggingface.co/ibraheemmoosa/xlmindic-base-uniscript) on Soham Bangla News Classification task which is part of the IndicGLUE benchmark. **Before pretraining this model we transliterate the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
44
+ library.** A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
45
+ where you can transliterate your text and use it on our model on the inference widget.
46
+
47
+ ## Model description
48
+
49
+ This model has the same configuration as the [ALBERT Base v2 model](https://huggingface.co/albert-base-v2/). Specifically, this model has the following configuration:
50
+
51
+ - 12 repeating layers
52
+ - 128 embedding dimension
53
+ - 768 hidden dimension
54
+ - 12 attention heads
55
+ - 11M parameters
56
+ - 512 sequence length
57
+
58
+ ## Training data
59
+ This model was fine-tuned on Soham dataset that is part of the IndicGLUE benchmark.
60
+
61
+ ## Transliteration
62
+
63
+ *The unique component of this model is that it takes in ISO-15919 transliterated text.*
64
+
65
+ The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
66
+
67
+ For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
68
+
69
+ An example of ISO-15919 transliteration for a piece of **Bangla** text is the following:
70
+
71
+ **Original:** "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক।"
72
+
73
+ **Transliterated:** 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika.'
74
+
75
+ Another example for a piece of **Hindi** text is the following:
76
+
77
+ **Original:** "चूंकि मानव परिवार के सभी सदस्यों के जन्मजात गौरव और समान तथा अविच्छिन्न अधिकार की स्वीकृति ही विश्व-शान्ति, न्याय और स्वतन्त्रता की बुनियाद है"
78
+
79
+ **Transliterated:** "cūṁki mānava parivāra kē sabhī sadasyōṁ kē janmajāta gaurava aura samāna tathā avicchinna adhikāra kī svīkr̥ti hī viśva-śānti, nyāya aura svatantratā kī buniyāda hai"
80
+
81
+
82
+ ## Training procedure
83
+
84
+ ### Preprocessing
85
+
86
+ The texts are transliterated to ISO-15919 format using the Aksharamukha library. Then these are tokenized using SentencePiece and a vocabulary size of 50,000.
87
+
88
+ ### Training
89
+
90
+ The model was trained for 8 epochs with a batch size of 16 and a learning rate of *2e-5*.
91
+
92
+ ## Evaluation results
93
+ See results specific to Soham in the following table.
94
+
95
+ ### IndicGLUE
96
+ Task | mBERT | XLM-R | IndicBERT-Base | XLMIndic-Base-Uniscript (This Model) | XLMIndic-Base-Multiscript (Ablation Model)
97
+ -----| ----- | ----- | ------ | ------- | --------
98
+ Wikipedia Section Title Prediction | 71.90 | 65.45 | 69.40 | **81.78 ± 0.60** | 77.17 ± 0.76
99
+ Article Genre Classification | 88.64 | 96.61 | 97.72 | **98.70 ± 0.29** | 98.30 ± 0.26
100
+ Named Entity Recognition (F1-score) | 71.29 | 62.18 | 56.69 | **89.85 ± 1.14** | 83.19 ± 1.58
101
+ BBC Hindi News Article Classification | 60.55 | 75.52 | 74.60 | **79.14 ± 0.60** | 77.28 ± 1.50
102
+ Soham Bangla News Article Classification | 80.23 | 87.6 | 78.45 | **93.89 ± 0.48** | 93.22 ± 0.49
103
+ INLTK Gujarati Headlines Genre Classification | - | - | **92.91** | 90.73 ± 0.75 | 90.41 ± 0.69
104
+ INLTK Marathi Headlines Genre Classification | - | - | **94.30** | 92.04 ± 0.47 | 92.21 ± 0.23
105
+ IITP Hindi Product Reviews Sentiment Classification | 74.57 | **78.97** | 71.32 | 77.18 ± 0.77 | 76.33 ± 0.84
106
+ IITP Hindi Movie Reviews Sentiment Classification | 56.77 | 61.61 | 59.03 | **66.34 ± 0.16** | 65.91 ± 2.20
107
+ MIDAS Hindi Discourse Type Classification | 71.20 | **79.94** | 78.44 | 78.54 ± 0.91 | 78.39 ± 0.33
108
+ Cloze Style Question Answering (Fill-mask task) | - | - | 37.16 | **41.54** | 38.21
109
+
110
+ ## Intended uses & limitations
111
+
112
+ This model is pretrained on Indo-Aryan languages. Thus it is intended to be used for downstream tasks on these languages. However, since Dravidian languages such as Malayalam, Telegu, Kannada etc share a lot of vocabulary with the Indo-Aryan languages, this model can potentially be used on those languages too (after transliterating the text to ISO-15919).
113
+
114
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
115
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
116
+ fine-tuned versions on a task that interests you.
117
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
118
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
119
+ generation you should look at model like GPT2.
120
+
121
+ ### How to use
122
+
123
+ To use this model you will need to first install the [Aksharamukha](https://pypi.org/project/aksharamukha/) library.
124
+
125
+ ```bash
126
+ pip install aksharamukha
127
+ ```
128
+
129
+ Using this library you can transliterate any text wriiten in Indic scripts in the following way:
130
+ ```python
131
+ >>> from aksharamukha import transliterate
132
+ >>> text = "चूंकि मानव परिवार के सभी सदस्यों के जन्मजात गौरव और समान तथा अविच्छिन्न अधिकार की स्वीकृति ही विश्व-शान्ति, न्याय और स्वतन्त्रता की बुनियाद है"
133
+ >>> transliterated_text = transliterate.process('autodetect', 'ISO', text)
134
+ >>> transliterated_text
135
+ "cūṁki mānava parivāra kē sabhī sadasyōṁ kē janmajāta gaurava aura samāna tathā avicchinna adhikāra kī svīkr̥ti hī viśva-śānti, nyāya aura svatantratā kī buniyāda hai"
136
+ ```
137
+
138
+ Then you can use this model directly with a pipeline for masked language modeling:
139
+
140
+ ```python
141
+ >>> from transformers import pipeline
142
+ >>> from aksharamukha import transliterate
143
+ >>> unmasker = pipeline('fill-mask', model='ibraheemmoosa/xlmindic-base-uniscript')
144
+ >>> text = "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি [MASK], ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।"
145
+ >>> transliterated_text = transliterate.process('Bengali', 'ISO', text)
146
+ >>> transliterated_text
147
+ 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
148
+ >>> unmasker(transliterated_text)
149
+ [{'score': 0.39705055952072144,
150
+ 'token': 1500,
151
+ 'token_str': 'abhinētā',
152
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli abhinētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
153
+ {'score': 0.20499080419540405,
154
+ 'token': 3585,
155
+ 'token_str': 'kabi',
156
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
157
+ {'score': 0.1314290314912796,
158
+ 'token': 15402,
159
+ 'token_str': 'rājanētā',
160
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli rājanētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
161
+ {'score': 0.060830358415842056,
162
+ 'token': 3212,
163
+ 'token_str': 'kalākāra',
164
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kalākāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
165
+ {'score': 0.035522934049367905,
166
+ 'token': 11586,
167
+ 'token_str': 'sāhityakāra',
168
+ 'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli sāhityakāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}]
169
+ ```
170
+
171
+ ### Limitations and bias
172
+
173
+ Even though we pretrain on a comparatively large multilingual corpus the model may exhibit harmful gender, ethnic and political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on the model to make decisions.
174
+
175
+ ## Contact
176
+
177
+ Feel free to contact us if you have any ideas or if you want to know more about our models.
178
+ - Ibraheem Muhammad Moosa (ibraheemmoosa1347@gmail.com)
179
+ - Mahmud Elahi Akhter (mahmud.akhter01@northsouth.edu)
180
+ - Ashfia Binte Habib
181
+
182
+ ## BibTeX entry and citation info
183
+
184
+ Coming soon!
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": ".",
3
+ "architectures": [
4
+ "AlbertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0,
7
+ "bos_token_id": 2,
8
+ "classifier_dropout_prob": 0.1,
9
+ "embedding_size": 128,
10
+ "eos_token_id": 3,
11
+ "hidden_act": "gelu_new",
12
+ "hidden_dropout_prob": 0,
13
+ "hidden_size": 768,
14
+ "id2label": {
15
+ "0": "India National News",
16
+ "1": "West Bengal State News",
17
+ "2": "Kolkata News",
18
+ "3": "Sports",
19
+ "4": "Entertainment",
20
+ "5": "International"
21
+ },
22
+ "initializer_range": 0.02,
23
+ "inner_group_num": 1,
24
+ "intermediate_size": 3072,
25
+ "label2id": {
26
+ "LABEL_0": 0,
27
+ "LABEL_1": 1,
28
+ "LABEL_2": 2,
29
+ "LABEL_3": 3,
30
+ "LABEL_4": 4,
31
+ "LABEL_5": 5
32
+ },
33
+ "layer_norm_eps": 1e-12,
34
+ "max_position_embeddings": 512,
35
+ "model_type": "albert",
36
+ "num_attention_heads": 12,
37
+ "num_hidden_groups": 1,
38
+ "num_hidden_layers": 12,
39
+ "pad_token_id": 0,
40
+ "position_embedding_type": "absolute",
41
+ "problem_type": "single_label_classification",
42
+ "torch_dtype": "float32",
43
+ "transformers_version": "4.15.0",
44
+ "type_vocab_size": 2,
45
+ "vocab_size": 50000
46
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:beafddc5c971cd5fec86a64d12f65b66e4431c4433cb843e728d184db17d0dcf
3
+ size 56993846
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7b9819708b78ed7c174d1ef0133abc7435f27d5f39f7cda26aac608eecfc942
3
+ size 57007313
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": {"content": "[MASK]", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29a45ffb757d0acb1216f70e134c89c194e586ff42d026deb0c018b28d8daa7b
3
+ size 1242045
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8986ef8612d7dedf89b8a5283d715990f745c11b23e00e584c48450ab493424b
3
+ size 57034056
tokenizer.vocab ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "remove_space": true, "keep_accents": true, "bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": {"content": "[MASK]", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false, "__type": "AddedToken"}, "sp_model_kwargs": {}, "model_max_length": 512, "tokenizer_class": "AlbertTokenizer"}