Clarifications on how to use YaRN

by Downtown-Case - opened Sep 28, 2024

Sep 28, 2024

•

edited Sep 28, 2024

I'm trying to implement YaRN for Qwen 2.5 in a longer context framework and wrap my head around the transformers implementation here:

https://github.com/huggingface/transformers/blob/2e24ee4dfa39cc0bc264b89edbccc373c8337086/src/transformers/modeling_rope_utils.py#L163

The documentation mentions we are supposed to add this to the config for >32K usage with Qwen 2.5:

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

But it appears the transformers implementation doesn't actually read original_max_position_embeddings, but rather max_position_embeddings in the plain model config.

So lets say we want to run Qwen2.5 at 64K context in plain HF transformers... what exactly do I set? To I change max_position_embeddings to 64K, or leave it at 32K and let the framework "override" it? Because that's what it's going to read when computing the yarn scaling factors: https://github.com/huggingface/transformers/blob/2e24ee4dfa39cc0bc264b89edbccc373c8337086/src/transformers/modeling_rope_utils.py#L192

And... is this factor somehow dynamic? I don't see any trigger in transformers that makes it recompute the scale.

djuna

Oct 4, 2024

i think it supposed to be like that.
example from old llama 2
https://huggingface.co/NousResearch/Yarn-Llama-2-13b-128k/blob/main/config.json

jklj077

Qwen org Oct 8, 2024

unfortunately, I don't think it's possible for now to use transformers for 128K as YaRN is not supported in transformers. please use vllm.

Downtown-Case

Oct 9, 2024

It is though, its right in the code block I linked?

I ported the same code to exllama, and it seems to work.

jklj077

Qwen org Oct 10, 2024

Oh, sorry if I missed that. I don't remember we've implemented YaRN for Qwen2, but thanks to HF staff who are so helpful, it is indeed supported now (since transformers>=4.45.0)

That part of configuration in the readme/modelcard is originally supposed to be used by vllm which reads original_max_position_embeddings. But I think using that should be also okay for transformers.

Based on the code you linked, the setting should be the following for transformers:

{
  ...,
  "max_position_embeddings": 32768,
  "rope_scaling": {
    "factor": 4.0,
    "type": "yarn"
  }

(max_position_embeddings has already been 32768 in config.json).

jklj077

Qwen org Oct 10, 2024

So lets say we want to run Qwen2.5 at 64K context in plain HF transformers... what exactly do I set? To I change max_position_embeddings to 64K, or leave it at 32K and let the framework "override" it?

the thing matters for Qwen2 with YaRN is the rope scaling factor. we've tested factor=4 and the context length can be extended to 128k but shorter length accuracy may degrade. for 64k support, you may also need to set factor=4, but factor=2 may be okay too, depending on your evaluation results.

And... is this factor somehow dynamic? I don't see any trigger in transformers that makes it recompute the scale.

dynamic and static is kind of vague here. YaRN is static in the sense that for all sequence lengths the scaling is done in the same manner, which can be precomputed and cached. It is not like DynamicNTK where the scaling is dfferent for different sequence lengths.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment