Shorter context window to reduct inference memory allocation

#31
by JochenGrey - opened

Is it possible to shorten the context length to e.g. 50k to limit the amount of memory being used during inference?
Would rope scaling factors need to be adjusted in case of shorter inference context?

Perhaps a larger context is needed to reduce inference time?

@mikestaub why would larger context reduce inference time? If you fill say 3k of context with real tokens, doesnt the balance ~125k get filled with padding?
Is there anyway to reduce the context length? For example mimicking as if phi-3-vision had been Clip + Phi3-4k-instruct (rather than the 128k?)
@JochenGrey - Any idea how to reduce the context length?

Sign up or log in to comment