File size: 8,287 Bytes
cee0227
50b6eb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
bd6c9a6
50b6eb7
cee0227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
091ebaa
cee0227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
547e73d
 
 
 
 
 
cee0227
 
 
 
 
 
547e73d
cee0227
547e73d
 
 
cee0227
547e73d
cee0227
 
 
 
 
 
547e73d
cee0227
 
 
 
 
547e73d
 
 
cee0227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6aa65c
fea93d8
cee0227
 
 
 
 
 
 
 
 
 
 
 
9925f44
83abc00
 
 
 
 
 
 
cee0227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50b6eb7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
language:
- en
- hi
- as
- bn
- gu
- kn
- ml
- mr
- or
- pa
- ta
- te
thumbnail: "https://www.kooapp.com/_next/static/media/logoKuSolidOutline.1f4fa971.svg"
license: mit
pipeline_tag: fill-mask
widget:
- text: "I like multilingual [MASK]."
  example_title: "English"
- text: "मुझे बहुभाषी वर्गीकरण [MASK] है |"
  example_title: "Hindi"
- text: "বহুভাষিক শ্ৰেণীবিভাজন [MASK] ভাল লাগে।"
  example_title: "Assamese"
- text: "আমি বহুভাষিক শ্রেণীবিভাগ [MASK] করি।"
  example_title: "Bengali"
- text: "મને બહુભાષી વર્ગીકરણ [MASK] છે."
  example_title: "Gujarati"
- text: "ನಾನು [MASK] ವರ್ಗೀಕರಣವನ್ನು ಇಷ್ಟಪಡುತ್ತೇನೆ."
  example_title: "Kannada"
- text: "എനിക്ക് ബഹുഭാഷാ [MASK] ഇഷ്ടമാണ്."
  example_title: "Malayalam"
- text: "मला बहुभाषिक वर्गीकरण [MASK]."
  example_title: "Marathi"
- text: "ମୁଁ ବହୁଭାଷୀ ବର୍ଗୀକରଣ [MASK] କରେ |"
  example_title: "Oriya"
- text: "ਮੈਨੂੰ ਬਹੁ-ਭਾਸ਼ਾਈ ਵਰਗੀਕਰਨ [MASK] ਹੈ।"
  example_title: "Punjabi"
- text: "நான் [MASK] வகைப்படுத்தலை விரும்புகிறேன்."
  example_title: "Tamil"
- text: "నాకు బహుభాషా వర్గీకరణ [MASK] ఇష్టం."
  example_title: "Telugu"
---

# Model Card for KooBERT

KooBERT is a masked language model trained on data from the multilingual micro-blogging social media platform [Koo India](https://www.kooapp.com/). <br>
This model was built in collaboration with Koo India and AI4Bharat.

## Model Details

### Model Description

On Koo platform, we have microblogs (Koos) which are limited to 400 characters and are available in multiple languages.
The model was trained on a dataset that contains multilingual koos from Jan 2020 to Nov 2022 on masked language modeling task.


- **Model type:** BERT based pretrained model
- **Language(s) (NLP):** assamese, bengali, english, gujarati, hindi, kannada, malayalam, marathi, oriya, punjabi, tamil, telugu
- **License:** KooBERT released under the MIT License.

## Uses

This model can be used to perform downstream tasks like content classification, toxicity detection, etc. for supported Indic languages <br>
It can also be used with sentence-transformers library for the creation of multilingual vector embeddings for other uses.
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
## Bias, Risks, and Limitations
As with any machine learning model, KooBERT may have limitations and biases. It is important to keep in mind that this model was trained on Koo Social Media data and may not generalize well to other domains. It is also possible that the model may have biases in the data it was trained on, which may affect its predictions. It is recommended to evaluate the model on your specific use case and data to ensure it is appropriate for your needs.

## How to Get Started with the Model

Use the code below to get started with the model for general finetuning tasks. Please note this is just a sample for finetuning.

```
import torch
from datasets import load_dataset, load_metric
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Koodsml/KooBERT")
model = AutoModelForSequenceClassification.from_pretrained("Koodsml/KooBERT", num_labels=2)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True, max_length=128)

# Load the CoLA dataset
dataset = load_dataset("glue","cola")
dataset = dataset.rename_column('sentence', 'text')

datset_tok = dataset.map(tokenize_function, batched=True)

# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define the training arguments
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=datset_tok['train'],
    eval_dataset=datset_tok['validation'],
    compute_metrics=compute_metrics,
)

# Fine-tune on the CoLA dataset
trainer.train()

# Evaluate on the CoLA dataset
eval_results = trainer.evaluate(eval_dataset=cola_dataset['validation'])
print(eval_results)
```

We can also use KooBERT with the sentence-transformers library to create multilingual vector embeddings. Here is an example:
```
from sentence_transformers import SentenceTransformer

# Load the KooBERT model
koo_model = SentenceTransformer('Koodsml/KooBERT', device="cuda")

# Define the text
text = "यह हमेशा से हमारी सोच है"

# Get the embedding
embedding = koo_model.encode(text)
print(embedding)
```

## Training Details

### Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Following is the distribution of tokens over languages:

| Language         | Koos        | Avg Tokens per Koo  | Total Tokens |
|------------------|-------------|---------------------|--------------|
| assamese         | 562,050     | 16.4414198          | 9,240,900    |
| bengali          | 2,110,380   | 12.08918773         | 25,512,780   |
| english          | 17,889,600  | 10.93732057         | 195,664,290  |
| gujarati         | 1,825,770   | 14.33965395         | 26,180,910   |
| hindi            | 35,948,760  | 16.2337502          | 583,583,190  |
| kannada          | 2,653,860   | 12.04577107         | 31,967,790   |
| malayalam        | 71,370      | 10.32744851         | 737,070      |
| marathi          | 1,894,080   | 14.81544602         | 28,061,640   |
| oriya            | 87,930      | 14.1941317          | 1,248,090    |
| punjabi          | 940,260     | 18.59961075         | 17,488,470   |
| tamil            | 1,687,710   | 12.12147822         | 20,457,540   |
| telugu           | 2,471,940   | 10.55735576         | 26,097,150   |


Total Koos = 68,143,710<br>
Total Tokens = 966,239,820 (based on a close approximation)

### Training Procedure 

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing
Personal Identifiable Information (PII) was removed from data before training on microblogs.
Temperature Sampling to upsample low resource languages. We used a temperature of value of 0.7 (Refer Sec 3.1 https://arxiv.org/pdf/1901.07291.pdf) 


#### Training Hyperparameters

  **Training regime** 
+ Training steps - 1M steps
+ Warm - 10k steps 
+ Learning Rate - 5e-4
+ Scheduler - Linear Decay
+ Optimizer - Adam
+ Batch Size of 4096 sequences
+ Precision - fp32


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->
The model has not been benchmarked yet. We shall be releasing the benchmark data in a future update.


## Model Examination

<!-- Relevant interpretability work for the model goes here -->

### Model Architecture and Objective

KooBERT is pretrained with BERT Architecture on Masked Language Modeling with a vocabulary size of 128k and max sequence length of 128 tokens.

### Compute Infrastructure

KooBERT was trained on TPU v3 with 128 cores which took over 5 days.


## Contributors

Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:anoop.kunchukuttan@gmail.com))- IITM AI4Bharat<br>
Sumanth Doddapaneni ([dsumanth17@gmail.com](mailto:anoop.kunchukuttan@gmail.com))- IITM AI4Bharat<br>
Smiral Rashinkar ([smiral.rashinkar@kooapp.com](mailto:anoop.kunchukuttan@gmail.com))- Koo India