Why does the model output the embedding for the <s> token?

#3
by macleginn - opened

When asked for hidden states, causal models usually provide embeddings for all tokens in the input sentence. E.g., given the input "one two three", GPT-2 will return a tensor of size [1, 3, 768] for each layer. This model, surprisingly, returns tensors of size [1, 4, 4096], and the extra embedding seems to correspond to the initial <s> token. Its embeddings is therefore always the same:

In [4]: tokenisation = tok("one two three", return_tensors='pt')

In [5]: outputs = model(**tokenisation, output_hidden_states=True).hidden_states

In [6]: len(outputs)
Out[6]: 33

In [7]: outputs[-1].size()
Out[7]: torch.Size([1, 4, 4096])

In [8]: tok.tokenize("one two three")
Out[8]: ['▁one', '▁two', '▁three']

In [9]: tokenisation.input_ids[0]
Out[9]: tensor([   1,  551,  753, 1166])

In [10]: tok.decode(tokenisation.input_ids[0])
Out[10]: '<s>one two three'

In [11]: outputs[-1][0, 0]
Out[11]:
tensor([ 0.0468,  0.2356,  0.5536,  ...,  0.3180, -0.2200,  0.5274],
       grad_fn=<SelectBackward0>)

In [12]: tokenisation = tok("five six seven", return_tensors='pt')

In [13]: outputs = model(**tokenisation, output_hidden_states=True).hidden_states

In [14]: outputs[-1][0, 0]
Out[14]:
tensor([ 0.0468,  0.2356,  0.5536,  ...,  0.3180, -0.2200,  0.5274],
       grad_fn=<SelectBackward0>)

Was this done by design or is it an API bug?

OpenLM Research org

This is done by design. You can also turn off the BOS token during tokenization if you want.

young-geng changed discussion status to closed

Sign up or log in to comment