Edit model card

IgT5 unpaired model

Model pretrained on protein and antibody sequences using a masked language modeling (MLM) objective. It was introduced in the paper Large scale paired antibody language models.

The model is finetuned from ProtT5 using unpaired antibody sequences from the Observed Antibody Space.

Use

The encoder part of the model and tokeniser can be loaded using the transformers library

from transformers import T5EncoderModel, T5Tokenizer

tokeniser = T5Tokenizer.from_pretrained("Exscientia/IgT5_unpaired", do_lower_case=False)
model = T5EncoderModel.from_pretrained("Exscientia/IgT5_unpaired")

The tokeniser is used to prepare batch inputs

# single chain sequences
sequences = [
    "EVVMTQSPASLSVSPGERATLSCRARASLGISTDLAWYQQRPGQAPRLLIYGASTRATGIPARFSGSGSGTEFTLTISSLQSEDSAVYYCQQYSNWPLTFGGGTKVEIK",
    "ALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL"
]

# The tokeniser expects input of the form ["E V V M...", "A L T Q..."]
sequences = [' '.join(sequence) for sequence in sequences] 

tokens = tokeniser.batch_encode_plus(
    sequences, 
    add_special_tokens=True, 
    pad_to_max_length=True, 
    return_tensors="pt",
    return_special_tokens_mask=True
) 

Note that the tokeniser adds a </s> token at the end of each sequence and pads using the <pad> token. For example a batch containing sequences E V V M, A L will be tokenised to E V V M </s> and A L </s> <pad> <pad>.

Sequence embeddings are generated by feeding tokens through the model

output = model(
    input_ids=tokens['input_ids'], 
    attention_mask=tokens['attention_mask']
)

residue_embeddings = output.last_hidden_state

To obtain a sequence representation, the residue tokens can be averaged over like so

import torch

# mask special tokens before summing over embeddings
residue_embeddings[tokens["special_tokens_mask"] == 1] = 0
sequence_embeddings_sum = residue_embeddings.sum(1)

# average embedding by dividing sum by sequence lengths
sequence_lengths = torch.sum(tokens["special_tokens_mask"] == 0, dim=1)
sequence_embeddings = sequence_embeddings_sum / sequence_lengths.unsqueeze(1)
Downloads last month
42
Safetensors
Model size
1.21B params
Tensor type
F32
·
Unable to determine this model’s pipeline type. Check the docs .

Finetuned from