tags:
- antibody language model
- antibody
- protein language model
base_model: Rostlab/prot_t5_xl_uniref50
license: mit
IgT5 unpaired model
Model pretrained on protein and antibody sequences using a masked language modeling (MLM) objective. It was introduced in the paper Large scale paired antibody language models.
The model is finetuned from ProtT5 using unpaired antibody sequences from the Observed Antibody Space.
Use
The encoder part of the model and tokeniser can be loaded using the transformers
library
from transformers import T5EncoderModel, T5Tokenizer
tokeniser = T5Tokenizer.from_pretrained("Exscientia/IgT5_unpaired", do_lower_case=False)
model = T5EncoderModel.from_pretrained("Exscientia/IgT5_unpaired")
The tokeniser is used to prepare batch inputs
# single chain sequences
sequences = [
"EVVMTQSPASLSVSPGERATLSCRARASLGISTDLAWYQQRPGQAPRLLIYGASTRATGIPARFSGSGSGTEFTLTISSLQSEDSAVYYCQQYSNWPLTFGGGTKVEIK",
"ALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL"
]
# The tokeniser expects input of the form ["E V V M...", "A L T Q..."]
sequences = [' '.join(sequence) for sequence in sequences]
tokens = tokeniser.batch_encode_plus(
sequences,
add_special_tokens=True,
pad_to_max_length=True,
return_tensors="pt",
return_special_tokens_mask=True
)
Note that the tokeniser adds a </s>
token at the end of each sequence and pads using the <pad>
token. For example a batch containing sequences E V V M
, A L
will be tokenised to E V V M </s>
and A L </s> <pad> <pad>
.
Sequence embeddings are generated by feeding tokens through the model
output = model(
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask']
)
residue_embeddings = output.last_hidden_state
To obtain a sequence representation, the residue tokens can be averaged over like so
import torch
# mask special tokens before summing over embeddings
residue_embeddings[tokens["special_tokens_mask"] == 1] = 0
sequence_embeddings_sum = residue_embeddings.sum(1)
# average embedding by dividing sum by sequence lengths
sequence_lengths = torch.sum(tokens["special_tokens_mask"] == 0, dim=1)
sequence_embeddings = sequence_embeddings_sum / sequence_lengths.unsqueeze(1)