--- language: sa --- # RoBERTa trained on Sanskrit (SanBERTa) **Mode size** (after training): **340MB** ### Dataset: [Wikipedia articles](https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles) (used in [iNLTK](https://github.com/goru001/nlp-for-sanskrit)). It contains evaluation set. [Sanskrit scraps from CLTK](http://cltk.org/) ### Configuration | Parameter | Value | |---|---| | `num_attention_heads` | 12 | | `num_hidden_layers` | 6 | | `hidden_size` | 768 | | `vocab_size` | 29407 | ### Training : - On TPU - For language modelling - Iteratively increasing `--block_size` from 128 to 256 over epochs ### Evaluation |Metric| # Value | |---|---| |Perplexity (`block_size=256`)|4.04| ## Example of usage: ### For Embeddings ``` tokenizer = AutoTokenizer.from_pretrained("surajp/SanBERTa") model = RobertaModel.from_pretrained("surajp/SanBERTa") op = tokenizer.encode("इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।", return_tensors="pt") ps = model(op) ps[0].shape ``` ``` ''' Output: -------- torch.Size([1, 47, 768]) ``` ### For \ Prediction ``` from transformers import pipeline fill_mask = pipeline( "fill-mask", model="surajp/SanBERTa", tokenizer="surajp/SanBERTa" ) ## इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते। fill_mask("इयं भाषा न केवल भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।") ps = model(torch.tensor(enc).unsqueeze(1)) print(ps[0].shape) ``` ``` ''' Output: -------- [{'score': 0.7516744136810303, 'sequence': ' इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।', 'token': 280, 'token_str': 'à¤Ĥ'}, {'score': 0.06230105459690094, 'sequence': ' इयं भाषा न केवली भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।', 'token': 289, 'token_str': 'à¥Ģ'}, {'score': 0.055410224944353104, 'sequence': ' इयं भाषा न केवला भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।', 'token': 265, 'token_str': 'ा'}, ...] ``` ```bibtex @misc{Parmar2020Sanberta, author = {Parmar, Suraj}, title = {SanBERTa - a RoBERTa trained on Sanskrit}, year = {2020}, month = {Jun}, publisher = {Hugging Face Model Hub}, url = {https://huggingface.co/surajp/SanBERTa} } ``` ### It works!! 🎉 🎉 🎉 > Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99) | [LinkedIn](https://www.linkedin.com/in/parmarsuraj99/) > Made with in India