Hessian context length?
At what context length have you computed the hessians?
500
how well does it work if you do inference over longer contexts than that?
I ask because I'm currently doing a hessian over 8k context, wondering what the impact of less context than what you want to do inference over is.
Nobody seems to know this at this moment.
I don't know because I haven't tried it. The perplexity of the model, after quantization with a sequence length of 500, increased from 5.3 in the original model to 5.5.
I'm currently quantizing a model. I don't think this is small enough to try on colab, is it?
A40 with 80GB of CPU memory is enough
but not sure 8k length, I'm trying
A40 with 80GB of CPU memory might be sufficient for a length of 8k. I tried one layer, and there were no abnormalities.
Yep, I'm doing another 34b model on 48gb with context length 8k, still a little vram left.
I meant something different: Have you tried your model with a longer context length than 500? How does it behave beyond the context length used for the hessian at inference?
If you do queries over context length of 2k tokens, does performance collapse?
I'm asking because I wonder if one can use more context than done for the hessian.
My basic idea is to use quantization so we can use long context LLMs over longer context given a certain vram budget (locally).
Actually it's simpler: Given my hardware, what is the longest query I can run on my system for any performance tier?
I suspect that yi might be the sweet spot. Properly allows queries over a long context in theory, being a high performance model.
Of cause there are also numerous smaller models with long context, but my simple question is, how can we squeeze out the best long context performance out of our hardware?
@KnutJaegersberg Did you ever find this out? I'm thinking of renting an instance on Lambda Cloud to compute the Hessians for CodeLlama 70B so that I can run it on a 24GB card
I did 3 models with quip#, one failed. I've made 2 at the maximum length over which the models were fine tuned, so they should be ok.
Currently I'm making a 2 bit gguf of codellama. It will take the night and tomorrow till noon or so. Uploading will take several hours also. you can get a model tomorrow or the day thereafter, but I don't know about the effects of calibration data on code models. I feel it will likely still work.
Still, if you feel like it, make hessians for code llama. but here is my experience on an aging a6000: It took 5 days or more to make hessians for a 70b model, 2 days for the other steps. It takes a week. On a faster GPU, it might take 4 days or so. That's why I did not make a lot of attempts at quip#. it's way faster if you do it at shorter context lengths, but I'm conservative on this. I don't think even the makers have experimented a lot with different context lenghts, yet. Might be ok.
I did 3 models with quip#, one failed. I've made 2 at the maximum length over which the models were fine tuned, so they should be ok.
Currently I'm making a 2 bit gguf of codellama. It will take the night and tomorrow till noon or so. Uploading will take several hours also. you can get a model tomorrow or the day thereafter, but I don't know about the effects of calibration data on code models. I feel it will likely still work.
Still, if you feel like it, make hessians for code llama. but here is my experience on an aging a6000: It took 5 days or more to make hessians for a 70b model, 2 days for the other steps. It takes a week. On a faster GPU, it might take 4 days or so. That's why I did not make a lot of attempts at quip#. it's way faster if you do it at shorter context lengths, but I'm conservative on this. I don't think even the makers have experimented a lot with different context lenghts, yet. Might be ok.
Cool! I heard GGUF got support for Vulkan so I'll check out your 2-bit code llama and hopefully I'll be able to run it on a 24 GB card. Otherwise QUIP# 2-bit seems like the only way to get it to fit. I don't want to spend 5 days on gpu cloud compute personally lol