I pass a padding input_ids into the model with the attention mask to ignore the padding tokens. However, the model seems will infer all tokens even I've masked out these padding tokens.

The following is an example.

`import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

Load model directly

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModel, AutoConfig
import torch

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
model_1 = AutoModel.from_pretrained("meta-llama/Llama-3.2-1B")

count all parameters

print(f"Number of parameters: {model.num_parameters()}")

prompt = "Once upon a time in a faraway land, "
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompt, truncation=True, padding='max_length', max_length=512, return_tensors='pt')

with torch.no_grad():
output = model(inputs["input_ids"], inputs['attention_mask'], output_hidden_states=True)
output_1 = model_1(inputs["input_ids"], inputs['attention_mask']).last_hidden_state

output_2 = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])

logits = output.logits # Raw output probabilities
print(logits.shape)
next_token = logits[:, -1, :].argmax(dim=-1) # Take the most probable next token
next_token_text = tokenizer.decode(next_token)
print(next_token_text)

print(output['hidden_states'][-1])
print(output_1)

either these outputs control the attention mask.

meta-llama
/

Llama-3.2-1B

Attention doesn't work for all layers except for the first layer

Load model directly

count all parameters

output_2 = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])