Attention doesn't work for all layers except for the first layer
I pass a padding input_ids into the model with the attention mask to ignore the padding tokens. However, the model seems will infer all tokens even I've masked out these padding tokens.
The following is an example.
`import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModel, AutoConfig
import torch
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
model_1 = AutoModel.from_pretrained("meta-llama/Llama-3.2-1B")
count all parameters
print(f"Number of parameters: {model.num_parameters()}")
prompt = "Once upon a time in a faraway land, "
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompt, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
with torch.no_grad():
output = model(inputs["input_ids"], inputs['attention_mask'], output_hidden_states=True)
output_1 = model_1(inputs["input_ids"], inputs['attention_mask']).last_hidden_state
output_2 = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
logits = output.logits # Raw output probabilities
print(logits.shape)
next_token = logits[:, -1, :].argmax(dim=-1) # Take the most probable next token
next_token_text = tokenizer.decode(next_token)
print(next_token_text)
print(output['hidden_states'][-1])
print(output_1)
either these outputs control the attention mask.