--- language: - en - hi - as - bn - gu - kn - ml - mr - or - pa - ta - te thumbnail: "https://www.kooapp.com/_next/static/media/logoKuSolidOutline.1f4fa971.svg" license: mit pipeline_tag: fill-mask widget: - text: "I like multilingual [MASK]." example_title: "English" - text: "मुझे बहुभाषी वर्गीकरण [MASK] है |" example_title: "Hindi" - text: "বহুভাষিক শ্ৰেণীবিভাজন [MASK] ভাল লাগে।" example_title: "Assamese" - text: "আমি বহুভাষিক শ্রেণীবিভাগ [MASK] করি।" example_title: "Bengali" - text: "મને બહુભાષી વર્ગીકરણ [MASK] છે." example_title: "Gujarati" - text: "ನಾನು [MASK] ವರ್ಗೀಕರಣವನ್ನು ಇಷ್ಟಪಡುತ್ತೇನೆ." example_title: "Kannada" - text: "എനിക്ക് ബഹുഭാഷാ [MASK] ഇഷ്ടമാണ്." example_title: "Malayalam" - text: "मला बहुभाषिक वर्गीकरण [MASK]." example_title: "Marathi" - text: "ମୁଁ ବହୁଭାଷୀ ବର୍ଗୀକରଣ [MASK] କରେ |" example_title: "Oriya" - text: "ਮੈਨੂੰ ਬਹੁ-ਭਾਸ਼ਾਈ ਵਰਗੀਕਰਨ [MASK] ਹੈ।" example_title: "Punjabi" - text: "நான் [MASK] வகைப்படுத்தலை விரும்புகிறேன்." example_title: "Tamil" - text: "నాకు బహుభాషా వర్గీకరణ [MASK] ఇష్టం." example_title: "Telugu" --- # Model Card for KooBERT KooBERT is a masked language model trained on data from the multilingual micro-blogging social media platform [Koo India](https://www.kooapp.com/).
This model was built in collaboration with Koo India and AI4Bharat. ## Model Details ### Model Description On Koo platform, we have microblogs (Koos) which are limited to 400 characters and are available in multiple languages. The model was trained on a dataset that contains multilingual koos from Jan 2020 to Nov 2022 on masked language modeling task. - **Model type:** BERT based pretrained model - **Language(s) (NLP):** assamese, bengali, english, gujarati, hindi, kannada, malayalam, marathi, oriya, punjabi, tamil, telugu - **License:** KooBERT released under the MIT License. ## Uses This model can be used to perform downstream tasks like content classification, toxicity detection, etc. for supported Indic languages
It can also be used with sentence-transformers library for the creation of multilingual vector embeddings for other uses. ## Bias, Risks, and Limitations As with any machine learning model, KooBERT may have limitations and biases. It is important to keep in mind that this model was trained on Koo Social Media data and may not generalize well to other domains. It is also possible that the model may have biases in the data it was trained on, which may affect its predictions. It is recommended to evaluate the model on your specific use case and data to ensure it is appropriate for your needs. ## How to Get Started with the Model Use the code below to get started with the model for general finetuning tasks. Please note this is just a sample for finetuning. ``` import torch from datasets import load_dataset, load_metric import evaluate from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("Koodsml/KooBERT") model = AutoModelForSequenceClassification.from_pretrained("Koodsml/KooBERT", num_labels=2) def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) def tokenize_function(examples): return tokenizer(examples["text"], padding='max_length', truncation=True, max_length=128) # Load the CoLA dataset dataset = load_dataset("glue","cola") dataset = dataset.rename_column('sentence', 'text') datset_tok = dataset.map(tokenize_function, batched=True) # Set the device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) # Define the training arguments training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch") # Define the trainer trainer = Trainer( model=model, args=training_args, train_dataset=datset_tok['train'], eval_dataset=datset_tok['validation'], compute_metrics=compute_metrics, ) # Fine-tune on the CoLA dataset trainer.train() # Evaluate on the CoLA dataset eval_results = trainer.evaluate(eval_dataset=cola_dataset['validation']) print(eval_results) ``` We can also use KooBERT with the sentence-transformers library to create multilingual vector embeddings. Here is an example: ``` from sentence_transformers import SentenceTransformer # Load the KooBERT model koo_model = SentenceTransformer('Koodsml/KooBERT', device="cuda") # Define the text text = "यह हमेशा से हमारी सोच है" # Get the embedding embedding = koo_model.encode(text) print(embedding) ``` ## Training Details ### Training Data Following is the distribution of tokens over languages: | Language | Koos | Avg Tokens per Koo | Total Tokens | |------------------|-------------|---------------------|--------------| | assamese | 562,050 | 16.4414198 | 9,240,900 | | bengali | 2,110,380 | 12.08918773 | 25,512,780 | | english | 17,889,600 | 10.93732057 | 195,664,290 | | gujarati | 1,825,770 | 14.33965395 | 26,180,910 | | hindi | 35,948,760 | 16.2337502 | 583,583,190 | | kannada | 2,653,860 | 12.04577107 | 31,967,790 | | malayalam | 71,370 | 10.32744851 | 737,070 | | marathi | 1,894,080 | 14.81544602 | 28,061,640 | | oriya | 87,930 | 14.1941317 | 1,248,090 | | punjabi | 940,260 | 18.59961075 | 17,488,470 | | tamil | 1,687,710 | 12.12147822 | 20,457,540 | | telugu | 2,471,940 | 10.55735576 | 26,097,150 | Total Koos = 68,143,710
Total Tokens = 966,239,820 (based on a close approximation) ### Training Procedure #### Preprocessing Personal Identifiable Information (PII) was removed from data before training on microblogs. Temperature Sampling to upsample low resource languages. We used a temperature of value of 0.7 (Refer Sec 3.1 https://arxiv.org/pdf/1901.07291.pdf) #### Training Hyperparameters **Training regime** + Training steps - 1M steps + Warm - 10k steps + Learning Rate - 5e-4 + Scheduler - Linear Decay + Optimizer - Adam + Batch Size of 4096 sequences + Precision - fp32 ## Evaluation The model has not been benchmarked yet. We shall be releasing the benchmark data in a future update. ## Model Examination ### Model Architecture and Objective KooBERT is pretrained with BERT Architecture on Masked Language Modeling with a vocabulary size of 128k and max sequence length of 128 tokens. ### Compute Infrastructure KooBERT was trained on TPU v3 with 128 cores which took over 5 days. ## Contributors Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:anoop.kunchukuttan@gmail.com))- IITM AI4Bharat
Sumanth Doddapaneni ([dsumanth17@gmail.com](mailto:anoop.kunchukuttan@gmail.com))- IITM AI4Bharat
Smiral Rashinkar ([smiral.rashinkar@kooapp.com](mailto:anoop.kunchukuttan@gmail.com))- Koo India