Model Card for Deepakvictor/tamil_bs_bert

BERT base model

Pretrained model on Tamil language using a masked language modeling (MLM) objective.It was introduced in this paper and first released in this repository.

Model description

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.In the same way this model is trained on tamil in a objective to predict a masked word [MASK]. Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.

Training of this model

This model was trained on the dataset AnanthZeke/tamil_sentences_master_raw. the first 10.6M sentences are used in training this model with a batch_size of 64. the model performed a loss of 0.687 in overall training. the model performed a loss of 0.80 in evaluation. the dataset used for for evaluation is the same dataset with last 120000 rows

Model variations

BERT has originally been released in base and large variations, for cased and uncased input text. this model doesn't face any "case" input since language tamil doesn't work on cases. this bert model is base model with 110M parameteres

Model	#params	Language
`bert-base-uncased`	110M	Tamil

Intended uses & limitations

You can use this raw model for masked language modeling. and can be used to finetune any task. since this model doesn't follow wordpiece tokenization and performed on subword tokenization there might be a higher chance that the predicted masked word may be a subword.

How to use

from transformers import pipeline
unmasker = pipeline('fill-mask', model='Deepakvictor/tamil_bs_bert')
unmasker("தமிழ் [MASK] வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்")

[{'score': 0.14111991226673126,
  'token': 12540,
  'token_str': 'மொழியை',
  'sequence': 'தமிழ் மொழியை வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
 {'score': 0.0806930884718895,
  'token': 2461,
  'token_str': 'மக்களுக்கு',
  'sequence': 'தமிழ் மக்களுக்கு வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
 {'score': 0.016404788941144943,
  'token': 3461,
  'token_str': 'எழுத',
  'sequence': 'தமிழ் எழுத வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
 {'score': 0.015853099524974823,
  'token': 5849,
  'token_str': 'எழுதி',
  'sequence': 'தமிழ் எழுதி வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
 {'score': 0.015091801062226295,
  'token': 1107,
  'token_str': 'எப்படி',
  'sequence': 'தமிழ் எப்படி வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'}]

To use the model in pytorch

# Load the model and tokenizer
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Deepakvictor/tamil_bs_bert")
model = AutoModelForMaskedLM.from_pretrained("Deepakvictor/tamil_bs_bert")

#tokenize the input
inp = tokenizer("தமிழ் [MASK] வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்",return_tensors="pt")
out = model(**inp)

#decode Process
tokenizer.decode(out.logits.softmax(-1).argmax(-1).view(-1).tolist(),skip_special_tokens=True)

Limitations and bias

As mentioned the model may output a subword with the masked token since the model is trained self-supervised there might be any biased found.

Training data

This BERT model was pretrained on tamil-sentence

Training procedure

Preprocessing

a Tokenizer is trained with the same dataset tamil-sentence with a vocab size of 29677 The details of the masking procedure for each sentence are the following: 15% of the tokens are masked.

pretraining

The model was trained on P100 GPU for ten million sentences with a batch size of 64.The optimizer used is AdamW with a learning rate of 1e-5,

Evaluation results

this bert-base model produces a evaluation loss of 0.8 on 1,20,200 sentences

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Deepakvictor
/

tamil_bs_bert

You need to agree to share your contact information to access this model