Maximum context length (512)

#2
by AsierRG55 - opened

I thought Llama2's maximum context length was 4,096 tokens. When I went to perform an inference through this model I saw that the maximum context length is 512. What is the reason for this modification?

Thank you

You have to set n_ctx parametr. 512 is default.
For example:

from llama_cpp import Llama
llm = Llama(model_path="wizardlm-1.0-uncensored-llama2-13b.Q5_K_S.gguf", n_ctx=4096, n_gpu_layers=-1)

Thanks a lot. And what for ctransformers implementation? I don't see the parameter either in AutoModelForCausalLM.from_pretrained nor in generation method.

I had to move from ctransformers to llama-cpp-python. https://github.com/abetlen/llama-cpp-python

There's currently the context_length parameter available in ctransformers: https://github.com/marella/ctransformers#config. So you can set something like this:

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7b-Chat-GGUF",
    # ...
    context_length=4096,
)

Is there a way to set this when using the Inference Endpoints/ API?

@karmiq I did set context_length = 4096 but somehow it still says "Token indices sequence length is longer than the specified maximum sequence length for this model (2093 > 2048)."
I am using AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGUF", hf=True,context_length=4096). Can you tell me what the issue is. Thanks

Update: I guess it's a versioning issue. I re installed and it works . Thanks

Sign up or log in to comment