All token embeddings end up identical.

#2
by drw-graphbook - opened
from transformers import AutoModel, AutoTokenizer
model_id = "AIRI-Institute/gena-lm-bert-base-t2t"

tokenizer = AutoTokenizer.from_pretrained(model_id)

seq = 'CACCCAGAGAGAGTAACCAGAATGGATACATTTTGGCCAACATGATTCTAACCCAGTGAGACCCATTTTGGGCTTATG'
tokens = tokenizer.tokenize(seq, add_special_tokens=True)
print('tokens:', tokens)
print('n_tokens:', len(tokens))

model = AutoModel.from_pretrained(model_id)
print(model)
with torch.no_grad():
    output = model(**tokenizer(seq, return_tensors='pt'), output_hidden_states=True, )

print(output.keys())

image.png

There's nothing strictly against all the rows getting the same value, but it is very odd. I see across the layers that they gradually converge to the same embedding for each token. Any explanation for that?

AIRI - Artificial Intelligence Research Institute org

Hi!

Just tried to run your code (but modified it to make it executable):

import torch
from transformers import AutoModel, AutoTokenizer
model_id = "AIRI-Institute/gena-lm-bert-base-t2t"

tokenizer = AutoTokenizer.from_pretrained(model_id)

seq = 'CACCCAGAGAGAGTAACCAGAATGGATACATTTTGGCCAACATGATTCTAACCCAGTGAGACCCATTTTGGGCTTATG'
tokens = tokenizer.tokenize(seq, add_special_tokens=True)
print('tokens:', tokens)
print('n_tokens:', len(tokens))

model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
# print(model)
with torch.no_grad():
    output = model(**tokenizer(seq, return_tensors='pt'), output_hidden_states=True, )

print(output.keys())
print(output.hidden_states[-1])

and it gives output:

tokens: ['[CLS]', 'CACCC', 'AGAGAGAG', 'TAACC', 'AGAATGG', 'ATACATT', 'TTGGCC', 'AACATG', 'ATTC', 'TAACCC', 'AGTGAGACCC', 'ATTTTGGGC', 'TTATG', '[SEP]']
n_tokens: 14
odict_keys(['logits', 'hidden_states'])
tensor([[[  1.7935,   9.7426,  -0.2816,  ..., -12.1350,  -4.8205,   2.5903],
         [ -3.1385,  -9.1969,   8.9219,  ...,   6.8791,   3.4876,  -4.8887],
         [ -1.0978,  -5.8843,  13.7623,  ...,  -3.3654,  -4.0092,   1.4867],
         ...,
         [ -0.9357,  -6.7364,  -3.1276,  ...,  -6.8377, -10.0209, -23.6296],
         [  2.4616,   2.8872,   1.8704,  ...,  -2.8146,  -2.5070, -17.5171],
         [  2.5083,  -1.9531,  -4.1511,  ...,   7.3335,   0.5013,  -2.4499]]])

that looks good.

It is strange that you use output.last_hidden_states on your screenshot as model outputs "logits" only. "hidden_states" are in output only if output_hidden_states is set to True.

The "trust_remote_code"=true gave me the correct bert modeling implementation, thank you for the clarification!

AIRI - Artificial Intelligence Research Institute org

Glad it helped!

yurakuratov changed discussion status to closed

Sign up or log in to comment