Back to all models
Model: monsoon-nlp/hindi-bert

Monthly model downloads

monsoon-nlp/hindi-bert monsoon-nlp/hindi-bert
- downloads
last 30 days



Contributed by

monsoon-nlp Nick Doiron
1 model

How to use this model directly from the 🤗/transformers library:

Copy model
tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/hindi-bert") model = AutoModel.from_pretrained("monsoon-nlp/hindi-bert")

Hindi-BERT (Discriminator)

This is a first run of a Hindi language model trained with Google Research's ELECTRA. I don't modify ELECTRA until we get into finetuning

Tokenization and training CoLab:

Blog post:

Greatly influenced by:



The corpus is two files:

Bonus notes:

  • Adding English wiki text or parallel corpus could help with cross-lingual tasks and training


Bonus notes:

  • Created with HuggingFace Tokenizers; could be longer or shorter, review ELECTRA vocab_size param

Pretrain TF Records splits the corpus into training documents

Set the ELECTRA model size and whether to split the corpus by newlines. This process can take hours on its own.

Bonus notes:

  • I am not sure of the meaning of the corpus newline split (what is the alternative?) and given this corpus, which creates the better training docs


Structure your files, with data-dir named "trainer" here

- vocab.txt
- pretrain_tfrecords
-- (all .tfrecord... files)
- models
-- modelname
--- checkpoint
--- graph.pbtxt
--- model.*

CoLab notebook gives examples of GPU vs. TPU setup


Using this model with Transformers

Sample movie reviews classifier: