0.7071067812
Just found your repo looking for a copy of Meta-Llama-3-8B
without the extra ./original
to download and noticed you had asked this:
As I understand, The concept of the idea is to make model think twice but leap same distances like original. but why 0.7071067812?
The scale factor to use, eg: solve x^2 = 1/2 --> x = 1/sqrt(2) ≈ 0.7071067812
The strange number comes from the fact that the k_proj
and q_proj
matrices get multiplied together, eg:
If these just contained a single number (ie: a scaler) and the inputs were k_input
and q_input
scalers:
output
= (k_input
x k_proj
) x (q_input
x q_proj
)
output
= 1.0
xk_input
x k_proj
x q_input
x q_proj
and if you tried to halve the k_proj
and q_proj
values hoping to halve the output
:
output
= (k_input
x 0.5
x k_proj
) x (q_input
x 0.5
x q_proj
)
output
= k_input
x 0.5
x k_proj
x q_input
x 0.5
x q_proj
output
= 0.5
x 0.5
x k_input
x k_proj
x q_input
x q_proj
output
= 0.25
x k_input
x k_proj
x q_input
x q_proj
so as you can see the output will be quartered and not halved!
So the solution is to use sqrt(0.5)
like this:
output
= (k_input
x sqrt(0.5)
x k_proj
) x (q_input
x sqrt(0.5)
x q_proj
)
output
= k_input
x sqrt(0.5)
x k_proj
x q_input
x sqrt(0.5)
x q_proj
output
= sqrt(0.5)
x sqrt(0.5)
x k_input
x k_proj
x q_input
x q_proj
output
= 0.5
x k_input
x k_proj
x q_input
x q_proj
which is now correctly halving the output!
The decimal representation of sqrt(0.5)
is 0.7071067812
and this is where the number comes from.
Hope this helps! :)
Can’t believe you made a discussion to my model! I’m a huge fan of your control vectors, merging, MIQU models, etc. Your passion for this work always thrills me! Thank you for the kind explanation, and keep up the great work as always. Regards!
Can’t believe you made a discussion to my model! I’m a huge fan of your control vectors, merging, MIQU models, etc. Your passion for this work always thrills me! Thank you for the kind explanation, and keep up the great work as always. Regards!
Thanks! :)