Spaces:

flax-community
/

roberta-hindi

Runtime error

App Files Files Community

roberta-hindi / About /training_procedure.md

mlkorra

Update About Page

de89287 almost 3 years ago

|

raw history blame

No virus

1.25 kB

	## Training procedure
	### Preprocessing

	The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
	the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
	with `<s>` and the end of one by `</s>`
	The details of the masking procedure for each sentence are the following:
	- 15% of the tokens are masked.
	- In 80% of the cases, the masked tokens are replaced by `<mask>`.
	- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
	- In the 10% remaining cases, the masked tokens are left as is.
	Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).

	### Pretraining
	The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The
	optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
	\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
	rate after.