lmsys/vicuna-7b-v1.5-16k · error Inference on for long-context text

This is my demo code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Initialize the tokenizer and model from the pretrained version on Hugging Face
tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-7b-v1.5-16k")
model = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-7b-v1.5-16k")

# Prepare the text you want to infer on
text = "text" * 10000 
inputs = tokenizer(text, return_tensors="pt", max_length=16384, truncation=True)

# Generate output using the model
with torch.no_grad():
    outputs = model.generate(**inputs, max_length=16384, num_return_sequences=1)

# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

This is the error code:

    padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
RuntimeError: The size of tensor a (8192) must match the size of tensor b (10001) at non-singleton dimension 3

It seems that the maximum size of the causal mask is 8196:
https://github.com/huggingface/transformers/blob/0290ec19c901adc0f1230ebdccad11c40af026f5/src/transformers/models/llama/modeling_llama.py#L1079

env:
transformer: 4.38.2