clarification on the usage of `short_factor` and `long_factor`?

#49
by J22 - opened

My tests show that short_factor and long_factor shall be used like this:

Let max_length be the real maximum context length which shall be <= 128k.

  1. If the max_length is less than 4096, just use short_factor;

  2. If the max_length is greater than 4096, just use long_factor.

    Can we use long_factor for max_length less than 4096? Yes, but its performance is worse than short_factor.

Mixed use of these two factors would not work, even if they are switched as in Phi3SuScaledRotaryEmbedding on the boundary of batches, which means
that Phi3SuScaledRotaryEmbedding needs to be fixed.

Please correct me, if anything is wrong.

Microsoft org

I agree, though the issue is how to implement that since we won’t have any information regarding the true max_length that will be used.

The current implementation is relying on the amount of information that is used during the generation and re-calculates the inverse frequency based on that amount. For every generation smaller than 4096, the short_factor is used, else we use the long_factor.

One point of pain is the boundary around 4096, for example, 4095 and 4097, which will use different values for their rotary embeddings. The switch is not the ideal way, but my feeling is that keeping short_factor for a generation that was supposed to be small and turned out to be long is less reliable than switching to long_factor.

nguyenbh changed discussion status to closed

Sign up or log in to comment