Spaces:

flax-community
/

roberta-hindi

Runtime error

hassiahk commited on Jul 19, 2021

Commit

9a1f8a7

1 Parent(s): 6ecb4b7

Update About/training_procedure.md

Files changed (1) hide show

About/training_procedure.md CHANGED Viewed

@@ -1,10 +1,10 @@
 ## Training procedure
 ### Preprocessing
 The texts are tokenized using Byte-Pair Encoding (BPE) and a vocabulary size of 50265.
-- `mc4` and `oscar` datasets were available in `datasets` library. For rest of the datasets, we wrote our own loading scripts [available here](https://github.com/amankhandelia/roberta_hindi/blob/master/test_custom_datasets.py)
 - It was slightly challenging to run `mc4` dataset(104GB+) and perform preprocessing and use it in non-streaming mode. `datasets` library had a lot of helper functions which allowed us to merge & shuffle the datasets with ease.
-- We had to perform cleanup of mC4 and oscar datasets by removing all non hindi(non Devanagiri) characters from the datasets, as the datasets are relatively somewhat noisy.
-- We attempted to filter out evaluation set of [WikiNER of IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by manual lablelling where the actual labels were not correct and modified the downstream evaluation dataset. Code & manually labelled file are also present in our [github repo](https://github.com/amankhandelia/roberta_hindi)
 ### Pretraining
 The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores).A randomized shuffle of combined dataset of mC4, oscar and other datasets listed above was used to train the model. Downstream training logs are present in [wandb](https://wandb.ai/wandb/hf-flax-roberta-hindi).

 ## Training procedure
 ### Preprocessing
 The texts are tokenized using Byte-Pair Encoding (BPE) and a vocabulary size of 50265.
+- `mc4` and `oscar` datasets were available in `datasets` library. For rest of the datasets, we wrote our own loading scripts [available here](https://github.com/amankhandelia/roberta_hindi/blob/master/test_custom_datasets.py).
 - It was slightly challenging to run `mc4` dataset(104GB+) and perform preprocessing and use it in non-streaming mode. `datasets` library had a lot of helper functions which allowed us to merge & shuffle the datasets with ease.
+- We had to perform cleanup of mC4 and oscar datasets by removing all non hindi (non Devanagiri) characters from the datasets, as the datasets are relatively somewhat noisy.
+- We attempted to filter out evaluation set of [WikiNER of IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by manual labelling where the actual labels were not correct and modified the downstream evaluation dataset. Code & manually labelled file are also present in our [github repo](https://github.com/amankhandelia/roberta_hindi)
 ### Pretraining
 The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores).A randomized shuffle of combined dataset of mC4, oscar and other datasets listed above was used to train the model. Downstream training logs are present in [wandb](https://wandb.ai/wandb/hf-flax-roberta-hindi).