Model: monsoon-nlp/hindi-bert

monsoon-nlp Nick Doiron
How to use this model directly from the 🤗/transformers library:

tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/hindi-bert") model = AutoModel.from_pretrained("monsoon-nlp/hindi-bert")

Hindi-BERT (Discriminator)

This is a first run of a Hindi language model trained with Google Research's ELECTRA. I don't modify ELECTRA until we get into finetuning

Tokenization and training CoLab:

Blog post:

Greatly influenced by:



The corpus is two files:

Bonus notes:

  • Adding English wiki text or parallel corpus could help with cross-lingual tasks and training


Bonus notes:

  • Created with HuggingFace Tokenizers; could be longer or shorter, review ELECTRA vocab_size param

Pretrain TF Records splits the corpus into training documents

Set the ELECTRA model size and whether to split the corpus by newlines. This process can take hours on its own.

Bonus notes:

  • I am not sure of the meaning of the corpus newline split (what is the alternative?) and given this corpus, which creates the better training docs


Structure your files, with data-dir named "trainer" here

- vocab.txt
- pretrain_tfrecords
-- (all .tfrecord... files)
- models
-- modelname
--- checkpoint
--- graph.pbtxt
--- model.*

CoLab notebook gives examples of GPU vs. TPU setup


Using this model with Transformers

Sample movie reviews classifier: