kaiokendev
/

superhot-13b-16k-no-rlhf-test

Model card Files Files and versions Community

Difference between this and 8k version?

by flashvenom - opened Jun 23, 2023

Discussion

flashvenom

Jun 23, 2023

Seems like training data and params are the same, other than the different in config what is different with this model?

kaiokendev

Owner Jun 23, 2023

This model is trained on a scaling factor of 0.125 using the same technique that was used on the 8K model. This model should have max sequence length of 16384

flashvenom

Jun 23, 2023

Ah, I wonder how the training on the 0.125 scaling factor affects performance at lower context lengths, ie. I wonder how the 8k model and 16k model perform at lets say a 4k context length and 1:2 ratio

kaiokendev

Owner Jun 23, 2023

•

edited Jun 23, 2023

A similar perplexity text should determine what performance difference exists, since the only difference between this and the 8K version is the scaling factor and max length, so it should be easy to compare the two. I will also try training one with just scaling of 0.5 (4096 max length)

flashvenom

Jun 23, 2023

sounds good, ill merge the adapter and see if I can get some numbers for you, im curious to see how it changes

flashvenom

Jun 23, 2023

•

edited Jun 23, 2023

I have some results for you:

16K merged into model has ppl of 7.3050, 8K merged into a model has ppl of 7.5387 (using 2k context)
16K is at 7.7976 and the 8K is at 7.7789 - at 8k context, scaling at 4
16K is at 9.4433 and the 8K is at 11.3963 - at 16K context and 8 scaling

very interesting that the 16K model seems to be somewhat "better" at 2K context, and also proof that the 16K training works with the lower ppl at higher scaling

flashvenom

Jun 23, 2023

This sounds silly, but can we train a 32K/64K model? I wonder if this will trend will continue for some reason -- we do need to test recall at larger context lengths too but with this pattern at 64K SuperHOT running at 8K context will probably be better than a 8K SuperHOT

nxnhjrjtbjfzhrovwl

Jun 23, 2023

•

edited Jun 23, 2023

I have some results for you:

16K merged into model has ppl of 7.3050, 8K merged into a model has ppl of 7.5387 (using 2k context)
16K is at 7.7976 and the 8K is at 7.7789 - at 8k context, scaling at 4
16K is at 9.4433 and the 8K is at 11.3963 - at 16K context and 8 scaling

very interesting that the 16K model seems to be somewhat "better" at 2K context, and also proof that the 16K training works with the lower ppl at higher scaling

this difference might just be because 16K was trained with lora rank 4 and 8K was trained with lora rank 2

kaiokendev

Owner Jun 23, 2023

•

edited Jun 23, 2023

@flashvenom I would expect the ppl to be lower at 16K for the one trained with 16K since it learns the proper dilated frequency. Still surprising it has lower ppl on the short range as well

can we train a 32K/64K model?

This model was a test just to see if my idea was correct that even with 4K data, it should work even if you go to 16K. It seems to work, so I would encourage others to try going even higher to find the limit now that you don't need 16K data to train to 16K context. I want to also investigate some other architectural changes we could make.

alkeryn

Jun 24, 2023

training 32K/64K models might be worthwile for experiment sake, but not a lot of people have enough vram to do that.
this is a great breakthrough but we may want to look at new ways to increase context without increasing vram requirements as much, still, that's amazing we could increase context that much.

kaiokendev

Owner Jun 24, 2023

•

edited Jun 24, 2023

@alkeryn This method is solely an augmentation to positional encodings. It only allows the context of the pre-trained model to be increased without using much training data or compute. The issue of quadratic attention is orthogonal to this method (e.g. fast attention method will likely not have anything to do with position encoding) Besides, mechanism such as xformers and flash attention also exist, not to mention the recent vLLM, which can all work alongside since the issue of attention and KV cache are solely outside the domain of position information.

Although I am currently working on a method to alleviate the memory issue.

EDIT: To clarify further, extending context here is tackled as a position encoding problem. Using that context is a separate issue.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment