Regarding the issue of using the protbert model to encode protein sequences, but with identical results.

#4
by tangpingdadaguai - opened

I use the protbert model to code proteins, but no matter which protein value sequence is encoded, the results are the same. When I replace the model with the biobert model, shorter sequences can be coded normally, but the coding results of longer sequences are the same. They have been stuck here for many days, so I will ask you to solve them.

The code is as follows:

import pandas as pd
from transformers import AutoTokenizer, AutoModel

sequences = ["MKTVRQERLKSIVRILERSKEPVSGAQGQPRGVRGF",
"MSDTKGDPGRH",
"MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLRNNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERETPTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGS",
"LLNGSLAEEIVIRTENIADNTKDIIVQFNKTVSIACTRPHNNTRRGIHIGPGQAFYATGDIIGDIRQAHCNVSGENWTETMEWVKAKLEKTFNVTNITFEPPITGGDLEITTHSFNCRGEFFYCNTSKLFNSSELNSIKGKENYTIILPCRIKQFVRMWQRVGQAMYAPPIEGNITCISNITGLILTRDGGINRTNDTFRPGGGDMRDNWRRKL",
"QELLCAASLISDRWVLTAAHCLLYPPWDKNFTVNDILVRIGKYARSRYERNMEKISTLEKIIIHPGYNWRENLDRDIALMKLKKPVAFSDYIHPVCLPDKQIVTSLLQAGHKGRVTGWGNLKEMWTVNMNEVQPSVLQMVNLPLVERPICKASTGIRVTDNMFCAGYKPEEGKRGDACEGDSGGPFVMKNPYNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIRKMVDRFG",
"MSAPASTTQATGSTTSTTTKTAGATPATASGLFTIPDGDFFSTARAVVASDAVATNEDLSEIEAVWKDMKVPTDTMAQAAWDLVRHCADVGSSAQTEMIDTGPYSNGISRARLAAAIKEVCTLRQFCMKYAPVVWNWMLTNNSPPANWQAQGFKPEHKFAAFDFFNGVTNPAAIMPKEGLIRPPSEAEMNAAQTAAFVKITKARAQSNDFASLDAAVTRGRITGTTTAEAVVTLPPP"
]

protein_sequences = pd.Series(sequences)
window_size = 256
step_size = 250

model_name = r'E:\NLP_model\prot_bert'
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoded_features = []
for protein_sequence in protein_sequences:
windows = [protein_sequence[i:i + window_size] for i in range(0, len(protein_sequence), step_size)]
sequence_features = []
for window in windows:
inputs = tokenizer(window, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
sequence_features.extend(embeddings)
encoded_features.append(sequence_features)

df_features = pd.DataFrame(encoded_features)

print(df_features.shape)
df_features

This figure is the result of running the protbert model.
1.PNG

This figure is the result of running the Biobert model.

2.PNG

Rostlab org

Hi; I would kindly re-direct you to our github where we have multiple examples & tutorials, incl. a colab, that show how to get started: https://github.com/agemagician/ProtTrans
In general, I would recommend to use ProtT5 and not ProtBERT.

mheinz changed discussion status to closed

Sign up or log in to comment