About the `rope_theta` values of `Yi-34B` and `Yi-34B-200k`

#19
by horiz94 - opened

A fantastic open-source endeavor!

I'm puzzled by a few aspects:

Firstly, why do both Yi-34B and Yi-34B-200k have such large rope_theta values (5,000,000 and 10,000,000 respectively) in their config.json files? Moreover, before the latest update, Yi-34B-200k even shared the same rope_theta value as Yi-34B. Typically, in line with other open-source projects, shouldn't the rope_theta of a base model be around 10,000? I'm also keen to understand the rope_theta values and training seq_len used during the pre-training and window extrapolation stages for the Yi-34B(-200k) model. Unfortunately, this information wasn't provided in the recently-released technical report.

Additionally, there might be a typo in your report. You mentioned:

We continue to pretrain the model on 5B tokens with a 4M batch size, which translates to 100 optimization steps.

However, shouldn't it be 5000M / 4M, resulting in 1250 steps?

Thank you.

@horiz94 In terms of the large rope_theta values, I can only say that it is a decision that we made after many trials and careful considerations. For any further information, I can only provide you as much as the report does.

In terms of the typo, you are right, I will take a look if there is anything our team can do about it as of now. Thank you for your support.

Sign up or log in to comment