All token embeddings end up identical.

by drw-graphbook - opened Apr 11

Apr 11

•

from transformers import AutoModel, AutoTokenizer
model_id = "AIRI-Institute/gena-lm-bert-base-t2t"

tokenizer = AutoTokenizer.from_pretrained(model_id)

seq = 'CACCCAGAGAGAGTAACCAGAATGGATACATTTTGGCCAACATGATTCTAACCCAGTGAGACCCATTTTGGGCTTATG'
tokens = tokenizer.tokenize(seq, add_special_tokens=True)
print('tokens:', tokens)
print('n_tokens:', len(tokens))

model = AutoModel.from_pretrained(model_id)
print(model)
with torch.no_grad():
    output = model(**tokenizer(seq, return_tensors='pt'), output_hidden_states=True, )

print(output.keys())

There's nothing strictly against all the rows getting the same value, but it is very odd. I see across the layers that they gradually converge to the same embedding for each token. Any explanation for that?

yurakuratov

AIRI - Artificial Intelligence Research Institute org Apr 11

Hi!

Just tried to run your code (but modified it to make it executable):

import torch
from transformers import AutoModel, AutoTokenizer
model_id = "AIRI-Institute/gena-lm-bert-base-t2t"

tokenizer = AutoTokenizer.from_pretrained(model_id)

seq = 'CACCCAGAGAGAGTAACCAGAATGGATACATTTTGGCCAACATGATTCTAACCCAGTGAGACCCATTTTGGGCTTATG'
tokens = tokenizer.tokenize(seq, add_special_tokens=True)
print('tokens:', tokens)
print('n_tokens:', len(tokens))

model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
# print(model)
with torch.no_grad():
    output = model(**tokenizer(seq, return_tensors='pt'), output_hidden_states=True, )

print(output.keys())
print(output.hidden_states[-1])

and it gives output:

tokens: ['[CLS]', 'CACCC', 'AGAGAGAG', 'TAACC', 'AGAATGG', 'ATACATT', 'TTGGCC', 'AACATG', 'ATTC', 'TAACCC', 'AGTGAGACCC', 'ATTTTGGGC', 'TTATG', '[SEP]']
n_tokens: 14
odict_keys(['logits', 'hidden_states'])
tensor([[[  1.7935,   9.7426,  -0.2816,  ..., -12.1350,  -4.8205,   2.5903],
         [ -3.1385,  -9.1969,   8.9219,  ...,   6.8791,   3.4876,  -4.8887],
         [ -1.0978,  -5.8843,  13.7623,  ...,  -3.3654,  -4.0092,   1.4867],
         ...,
         [ -0.9357,  -6.7364,  -3.1276,  ...,  -6.8377, -10.0209, -23.6296],
         [  2.4616,   2.8872,   1.8704,  ...,  -2.8146,  -2.5070, -17.5171],
         [  2.5083,  -1.9531,  -4.1511,  ...,   7.3335,   0.5013,  -2.4499]]])

that looks good.

It is strange that you use output.last_hidden_states on your screenshot as model outputs "logits" only. "hidden_states" are in output only if output_hidden_states is set to True.

drw-graphbook

Apr 11

The "trust_remote_code"=true gave me the correct bert modeling implementation, thank you for the clarification!

yurakuratov

AIRI - Artificial Intelligence Research Institute org Apr 11

Glad it helped!

yurakuratov changed discussion status to closed Apr 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment