Optimization details
#16
by
ShaneTian
- opened
I noticed that StarCoder2 uses two stages of pre-training, with the stage 2 for long-context training.
Take StarCoder-15B as an example,
- In the stage 1,
rope_theta=1e4
,warmup=1000
,max_lr=3e-4
- In the stage 2,
rope_theta=1e5
,max_lr=3e-5
So the question is:
- In the stage 1, what is
min_lr
? - In the stage 2, what is
min_lr
andwarmup
?
min_lr
is max_lr/10
and the warmup of the second stage is also 1000
Thanks
ShaneTian
changed discussion status to
closed