Sequential Prefilling
Below is my code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(“falcon-mamba-7b”)
model = AutoModelForCausalLM.from_pretrained(“falcon-mamba-7b”, device_map=“auto”, torch_dtype=torch.bfloat16)
input_text = [", ".join([“Iron Man”]*7)]
input_ids = tokenizer(input_text, return_tensors=“pt”).input_ids.to(“cuda”)
a = model(input_ids, output_hidden_states=True).hidden_states
cache = model(input_ids[:, :6]).cache_params
b = model(input_ids[:, 6:], cache_params=cache, cache_position=torch.tensor([0, 1, 2, 3]), output_hidden_states=True).hidden_states
print((a[-1][0][-1]-b[-1][0][-1]).abs().max())
For sequential prefilling like this, the generated hidden_states by the two ways should be the same for Mamba. However, my code doesn’t work well. I’m wondering how to set ‘cache_postition’ properly, and it seems that it only accepts tensor of shape(4), where 4 is the default conv_kernel size.
There is no example code for features like this, can anyone help me?