## Training procedure
### Preprocessing
The texts are tokenized using Byte-Pair Encoding (BPE) and a vocabulary size of 50265.
- `mc4` and `oscar` datasets were available in `datasets` library. For rest of the datasets, we wrote our own loading scripts [available here](https://github.com/amankhandelia/roberta_hindi/blob/master/test_custom_datasets.py).
- It was slightly challenging to run `mc4` dataset(104GB+) and perform preprocessing and use it in non-streaming mode. `datasets` library had a lot of helper functions which allowed us to merge & shuffle the datasets with ease.
- We had to perform cleanup of mC4 and oscar datasets by removing all non hindi (non Devanagiri) characters from the datasets, as the datasets are relatively somewhat noisy.
- We attempted to filter out evaluation set of [WikiNER of IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by manual labelling where the actual labels were not correct and modified the downstream evaluation dataset. Code & manually labelled file are also present in our [github repo](https://github.com/amankhandelia/roberta_hindi)

### Pretraining
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores).A randomized shuffle of combined dataset of mC4, oscar and other datasets listed above was used to train the model. Downstream training logs are present in [wandb](https://wandb.ai/wandb/hf-flax-roberta-hindi).