Inference API doesn't seem to support 100k context window

#16
by mlschmidt366 - opened

Hi,

I am trying to use HF's inference API to interact with the model from a gradio app. For larger inputs, I receive a validation error: "Input validation error: inputs tokens + max_new_tokens must be <= 8192". Is this a limitation on this HF implementation or am I using the inference API wrong? From the blog post I read that CodeLlama should support up to 100k tokens in the input. How to achieve that with this model?

You have to extend the context window using ROPE.

text-generation-launcher     --model-id $MODEL_ID     --rope-scaling dynamic     --max-input-length 16384     --max-total-tokens 32768     --max-batch-prefill-tokens 16384     --hostname 0.0.0.0     --port 3000

Hi,

I am trying to use HF's inference API to interact with the model from a gradio app. For larger inputs, I receive a validation error: "Input validation error: inputs tokens + max_new_tokens must be <= 8192". Is this a limitation on this HF implementation or am I using the inference API wrong? From the blog post I read that CodeLlama should support up to 100k tokens in the input. How to achieve that with this model?

I am also having this problem, am trying to use Langchain.

I'm having the same issue. Anybody have any insight? Is this configurable, or is it a hard limit through the Inference API model?

Sign up or log in to comment