Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints

Sequential Prefilling

#13
by CyberDancer - opened

Below is my code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(“falcon-mamba-7b”)
model = AutoModelForCausalLM.from_pretrained(“falcon-mamba-7b”, device_map=“auto”, torch_dtype=torch.bfloat16)

input_text = [", ".join([“Iron Man”]*7)]
input_ids = tokenizer(input_text, return_tensors=“pt”).input_ids.to(“cuda”)

a = model(input_ids, output_hidden_states=True).hidden_states

cache = model(input_ids[:, :6]).cache_params

b = model(input_ids[:, 6:], cache_params=cache, cache_position=torch.tensor([0, 1, 2, 3]), output_hidden_states=True).hidden_states

print((a[-1][0][-1]-b[-1][0][-1]).abs().max())

For sequential prefilling like this, the generated hidden_states by the two ways should be the same for Mamba. However, my code doesn’t work well. I’m wondering how to set ‘cache_postition’ properly, and it seems that it only accepts tensor of shape(4), where 4 is the default conv_kernel size.

There is no example code for features like this, can anyone help me?

Sign up or log in to comment