About the output of tokenizer and the model

by RandyWang504 - opened Aug 15, 2023

Aug 15, 2023

dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = embeding(inputs)[0] # [1, sequence_length, 768]
hidden_states.shape

The output is [1, 17, 768], and 17 is not the length of the dna sequence.
The output shape of the tokenizer which is inputs is [1, 17]

I think there must be something went wrong in my code.

odedka

Aug 17, 2023

it uses BPE so the number of tokens will not be equal to the number of chars.

RandyWang504

Aug 17, 2023

it uses BPE so the number of tokens will not be equal to the number of chars.

Thanks so much for your reply. I read the paper again and I thought most probably you were right! By the way, I suggest that maybe someone can edit the Model card and correct that note "# [1, sequence_length, 768]" since it may lead to misunderstanding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment