Impact of Padding on DNABERT Model Performance

#20
by poilkjhytg - opened

Hi,

I'm working with a DNABERT2 model and I have a question regarding the impact of padding on model performance. I've tokenized a DNA sequence and then compared the model's output for the original tokenized input against the same input with padding added.

import torch
from transformers import AutoTokenizer, AutoModel, BertConfig

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M")
dnabert_model = AutoModel.from_config(config)

dna = "CGTGGTTTCCTGTGGTTGGAATT"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]

with torch.no_grad() :
padded_input = F.pad(inputs, (0, 2), value = 3)
attention_mask = torch.tensor([[1,1,1,1,1,1,0,0]])
nonpad_hidden_states = dnabert_model(inputs)[0][:,0]
pad_hidden_states = dnabert_model(padded_input, attention_mask = attention_mask)[0][:,0]
print(nonpad_hidden_states.squeeze()[0:10])
print(pad_hidden_states.squeeze()[0:10])

The outputs for the non-padded and padded inputs are significantly different. I'm curious if padding is expected to affect the model's performance in this way, or if I might be misunderstanding how DNABERT handles padded tokens.

Can anyone provide insights into whether this behavior is expected and any recommended practices for handling padding with DNABERT?

Thank you!

Sign up or log in to comment