How was the rope_theta value determined?

#6
by ddh0 - opened

Hi, I see you're using a rope_theta value of 3M. The original llama 3 70B model used 500k, so I would normally expect to see a proportionate increase of 32768/8192, or a factor of 4. Instead you're using a value 6 times larger.

This is intriguing to me because I've recently been experimenting with RoPE theta values and I've currently settled on the formula ((n_ctx_desired/n_ctx_train)^(2^(1/4))) * rope_freq_base_train.

I'd be interested to know more about why this specific rope_theta value was chosen. Thanks!

It came over from abacusai/Smaug-Llama-3-70B-Instruct-32K and I don't know how they determined that value.

sophosympatheia changed discussion status to closed

Sign up or log in to comment