Henry Kenlay
commited on
Commit
•
55a6a74
1
Parent(s):
3c187e7
Upload README.md
Browse files
README.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- antibody language model
|
4 |
+
- antibody
|
5 |
+
base_model: Rostlab/prot_bert_bfd
|
6 |
+
license: mit
|
7 |
+
---
|
8 |
+
|
9 |
+
# IgBert unpaired model
|
10 |
+
|
11 |
+
Pretrained model on protein and antibody sequences using a masked language modeling (MLM) objective. It was introduced in the paper [Large scale paired antibody language models](https://arxiv.org/abs/2403.17889).
|
12 |
+
|
13 |
+
The model is finetuned from ProtBert-BFD using single chain antibody sequences from unpaired OAS.
|
14 |
+
|
15 |
+
# Use
|
16 |
+
|
17 |
+
The model and tokeniser can be loaded using the `transformers` library
|
18 |
+
|
19 |
+
```python
|
20 |
+
from transformers import BertModel, BertTokenizer
|
21 |
+
|
22 |
+
tokeniser = BertTokenizer.from_pretrained("Exscientia/IgBert_unpaired", do_lower_case=False)
|
23 |
+
model = BertModel.from_pretrained("Exscientia/IgBert_unpaired", add_pooling_layer=False)
|
24 |
+
```
|
25 |
+
|
26 |
+
The tokeniser is used to prepare batch inputs
|
27 |
+
```python
|
28 |
+
# single chain sequences
|
29 |
+
sequences = [
|
30 |
+
"EVVMTQSPASLSVSPGERATLSCRARASLGISTDLAWYQQRPGQAPRLLIYGASTRATGIPARFSGSGSGTEFTLTISSLQSEDSAVYYCQQYSNWPLTFGGGTKVEIK",
|
31 |
+
"ALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL"
|
32 |
+
]
|
33 |
+
|
34 |
+
# The tokeniser expects input of the form ["E V V M...", "A L T Q..."]
|
35 |
+
sequences = [' '.join(sequence) for sequence in sequences]
|
36 |
+
|
37 |
+
tokens = tokeniser.batch_encode_plus(
|
38 |
+
sequences,
|
39 |
+
add_special_tokens=True,
|
40 |
+
pad_to_max_length=True,
|
41 |
+
return_tensors="pt",
|
42 |
+
return_special_tokens_mask=True
|
43 |
+
)
|
44 |
+
```
|
45 |
+
|
46 |
+
Note that the tokeniser adds a `[CLS]` token at the beginning of each sequence, a `[SEP]` token at the end of each sequence and pads using the `[PAD]` token. For example a batch containing sequences `E V V M`, `A L` will be tokenised to `[CLS] E V V M [SEP]` and `[CLS] A L [SEP] [PAD] [PAD]`.
|
47 |
+
|
48 |
+
Sequence embeddings are generated by feeding tokens through the model
|
49 |
+
|
50 |
+
```python
|
51 |
+
output = model(
|
52 |
+
input_ids=tokens['input_ids'],
|
53 |
+
attention_mask=tokens['attention_mask']
|
54 |
+
)
|
55 |
+
|
56 |
+
residue_embeddings = output.last_hidden_state
|
57 |
+
```
|
58 |
+
|
59 |
+
To obtain a sequence representation, the residue tokens can be averaged over like so
|
60 |
+
|
61 |
+
```python
|
62 |
+
import torch
|
63 |
+
|
64 |
+
# mask special tokens before summing over embeddings
|
65 |
+
residue_embeddings[tokens["special_tokens_mask"] == 1] = 0
|
66 |
+
sequence_embeddings_sum = residue_embeddings.sum(1)
|
67 |
+
|
68 |
+
# average embedding by dividing sum by sequence lengths
|
69 |
+
sequence_lengths = torch.sum(tokens["special_tokens_mask"] == 0, dim=1)
|
70 |
+
sequence_embeddings = sequence_embeddings_sum / sequence_lengths.unsqueeze(1)
|
71 |
+
```
|
72 |
+
|
73 |
+
For sequence level fine-tuning the model can be loaded with a pooling head by setting `add_pooling_layer=True` and using `output.pooler_output` in the down-stream task.
|