Optimization details

#16

by ShaneTian - opened Mar 6

Mar 6

I noticed that StarCoder2 uses two stages of pre-training, with the stage 2 for long-context training.

Take StarCoder-15B as an example,

So the question is:

BigCode org Mar 11

min_lr is max_lr/10 and the warmup of the second stage is also 1000

Mar 11

Thanks

ShaneTian changed discussion status to closed Mar 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment