Question on the training epoch

by Tomohide - opened May 17, 2023

May 17, 2023

Thank you for releasing this great model.
I have one question.

The "Training" section says that "The model was trained on around 312.5B tokens from Japanese CC-100, Japanese C4, and Japanese Wikipedia.." .
I think the total number of tokens in these corpora is about 180B, and so this statement means the training epoch is 1.73 epochs (= 312.5 / 180)?

Thank you in advance.

tianyuz

May 18, 2023

@Tomohide You are welcome.
Since data processing, filtering, and resampling have been applied to the training data, the exact token number might not match your assumption.
But I believe the final dataset token number is not too different from 180B, so the estimation of 1.73 epochs should be close enough.

Tomohide

May 18, 2023

@tianyuz Thank you for your response!

Tomohide changed discussion status to closed May 18, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment