About the output of tokenizer and the model

#4
by RandyWang504 - opened

dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = embeding(inputs)[0] # [1, sequence_length, 768]
hidden_states.shape

The output is [1, 17, 768], and 17 is not the length of the dna sequence.
The output shape of the tokenizer which is inputs is [1, 17]

I think there must be something went wrong in my code.

it uses BPE so the number of tokens will not be equal to the number of chars.

it uses BPE so the number of tokens will not be equal to the number of chars.

Thanks so much for your reply. I read the paper again and I thought most probably you were right! By the way, I suggest that maybe someone can edit the Model card and correct that note "# [1, sequence_length, 768]" since it may lead to misunderstanding.

Sign up or log in to comment