increased context

#1
by Diavator - opened

Good afternoon. Have you considered creating a 20B model with a 16k or 32k context? The problem is that there are no such models. All models with extended context are mostly 7b,16B, 33B and higher.

Unfortunately for all of us, there is a non-linear relationship between trained context length and needed VRAM and time.

For each epoch (full training pass), this 20b model @ 4096 context needs 98% of the VRAM (78GB) and 100% of the GPU in a 80gb A100 for a batch size of 1, and it takes 5 hours. (~$10 cost).

Every ~512 tokens in new context roughly doubles the VRAM and also the processing requirements (for training). There are ~24 (forced to round up by VRAM) 512 token increases between 4k and 16k context, and ANOTHER 32 from 16k to 32k. So @ 16k context, it would take hundreds of A100s for five hours (many thousands of dollars in cost), and I don't think there are enough GPUs for rent on the planet to train a 20b 32k model.

OpenAI and Google and such can do it because they have massive server farms and trade secret software. Mistral can do it due to SWA, something L2 does not support. At this size, I think 4k is the best we'll see until Llama 3.

Sign up or log in to comment