Optimization details

#16
by ShaneTian - opened

I noticed that StarCoder2 uses two stages of pre-training, with the stage 2 for long-context training.

Take StarCoder-15B as an example,

  • In the stage 1, rope_theta=1e4, warmup=1000, max_lr=3e-4
  • In the stage 2, rope_theta=1e5, max_lr=3e-5

So the question is:

  • In the stage 1, what is min_lr?
  • In the stage 2, what is min_lr and warmup?
BigCode org

min_lr is max_lr/10 and the warmup of the second stage is also 1000

Thanks

ShaneTian changed discussion status to closed

Sign up or log in to comment