01-ai/Yi-34B-200K · About the `rope_theta` values of `Yi-34B` and `Yi-34B-200k`

A fantastic open-source endeavor!

I'm puzzled by a few aspects:

Firstly, why do both Yi-34B and Yi-34B-200k have such large rope_theta values (5,000,000 and 10,000,000 respectively) in their config.json files? Moreover, before the latest update, Yi-34B-200k even shared the same rope_theta value as Yi-34B. Typically, in line with other open-source projects, shouldn't the rope_theta of a base model be around 10,000? I'm also keen to understand the rope_theta values and training seq_len used during the pre-training and window extrapolation stages for the Yi-34B(-200k) model. Unfortunately, this information wasn't provided in the recently-released technical report.

Additionally, there might be a typo in your report. You mentioned:

We continue to pretrain the model on 5B tokens with a 4M batch size, which translates to 100 optimization steps.

However, shouldn't it be 5000M / 4M, resulting in 1250 steps?

Thank you.