Text Generation
Transformers
Safetensors
llama
conversational
Inference Endpoints
text-generation-inference

Context Length?

#1
by brucethemoose - opened

I'm very excited to try this model, especially once the DPO version comes out!

Just out of curiosity, what context length was it trained it?

I used 4096, mainly because nearly all of the instructions fell within that (and actually the vast majority was under that). I may do one more pass on the full scripts of cinematika to add some coherence out to the tens of thousands, but it's costly.

DPO version is ready BTW.

Yeah I just saw that! Hard to find stuff on HF. For what it's worth, other 200K finetunes seemed to preserve some long context performance even trained at 4K, but I would be extremely interested in a bagel finetuned out to just 40K-75K.

I'm not sure what you use to train, but you might find this context length VRAM usage/perplexity graph from a paper interesting: https://github.com/huggingface/peft/issues/958

As well as unsloth, which does reduce VRAM usage significantly: https://github.com/unslothai/unsloth

Technically unsloth and axoltl don't integrate longlora into lora training, but its probably fine?

I used a mix of qlora and some full weight tuning for this. Thanks for the link and info, very interesting!

I'd probably do full weight if/when I try extending trained ctx length, but I was hoping it would just inherit longer ctx capabilities from base.

I hope you do! But of course I appreciate the DPO finetune as is!

but I was hoping it would just inherit longer ctx capabilities from base.

Other models do, still quantizing bagel to test it myself. But I bet long context data would really help. This would make Bagel 34B particularly unique, as no one else (AFAIK) is really finetuning Yi 200K on a long context.

And yeah, I would recommend unsloth in particular, its just a huge drop-in VRAM savings + speed boost with no downside, at least in my own testing.

brucethemoose changed discussion status to closed

Sign up or log in to comment