Method used to extend the context?

#1
by Doctor-Shotgun - opened

This doesn't seem to use a higher rope theta or any kind of rope scaling specified in the config. What method is being used to extend the context length to 64k?

EDIT: Looking at the safetensor hashes, this appears to just be a re-upload of the original 8k ctx model, but with max_position_embeddings edited to 64k in config.json. Bruh...

NurtureAI org
edited Apr 27

i haven't finetuned it yet, but here is a finetuned version if thats what your looking for. https://huggingface.co/winglian/Llama-3-8b-64k-PoSE

EDIT: its not 70b though. Isn't changing rope scaling and the theta also just a config change? Bruh....

NurtureAI org

I havent posted this model anywhere, nor have I claimed it to be anything else. i just use nurtureai just for testing and downloading to cloud instances for further testing and fine tuning.

EDIT: its not 70b though. Isn't changing rope scaling and the theta also just a config change? Bruh....

What I meant is that typically you'd need to adjust the rope scale or theta and then subsequently finetune on top of it to effectively extend the context, so it was a bit curious to see those parameters not adjusted.

Regardless, I think it's fairly misleading to have this model listed as Meta-Llama-3-70B-Instruct-64k when it is in fact, by no means, a true context-extended model.

NurtureAI org

it has the same model card, but thanks for your input though. follow for updates.

NurtureAI org
edited Apr 28

Just because a model has not been fine tuned with 64k context does not mean that is not set for 64k context, which is what the title is for. Most of my work is private, but even though its not fine tuned its performing really well for some tasks just at current settings, which is why its here. Like i said before, for testing and further fine tuning.

I love doing this as a hobby, but only as my time allows as I have a job to do with my day to day which takes up most of my time and compute/resources, so if I ever decide to release anything to the public you know where to find it.

Yeah I was searching HF to see if anyone did any context extensions of L3 70B/Instruct and this repo came up. Just wanted to make sure that nobody doing the same spends the resources to download and quant thinking that it's been finetuned for 64k. If you do end up finetuning these 70Bs for 64k, would definitely be interesting to try out. 8k feels rather limiting in this age.

NurtureAI org

No problem, yeah I hear you. Makes sense, and i definitely agree with you there. I will probably be doing 8b first to experiment with a few different datasets and see what rope theta gives the most gains, but that is definitely my end goal.

Sign up or log in to comment