File size: 3,255 Bytes
5b022c8
 
 
 
 
 
 
 
 
 
 
 
 
 
d3b48ad
5b022c8
ac8dd0f
5b022c8
 
 
d3b48ad
5b022c8
 
 
 
 
d3b48ad
0f2ad6a
5b022c8
 
 
 
 
 
7b745fe
5b022c8
03f96d6
5b022c8
 
104cbf1
5b022c8
 
d3b48ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b022c8
 
 
 
 
 
 
 
 
 
 
d3b48ad
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: bsd
tags:
- chemistry
- biology
- protein
- antibodies
- antibody
- light chain
- AbLang
- CDR
- OAS
---

### AbLang model for light chains

This is a 🤗 version of AbLang: A language model for antibodies. It was introduced in
[this paper](https://doi.org/10.1101/2022.01.20.477061) and first released in
[this repository](https://github.com/oxpig/AbLang). This model is trained on uppercase amino acids: it only works with capital letter amino acids.

### Intended uses & limitations

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks (TBA).

### How to use

Here is how to use this model to get the features of a given antibody sequence in PyTorch:

```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('qilowoq/AbLang_light')
model = AutoModel.from_pretrained('qilowoq/AbLang_light', trust_remote_code=True)

sequence_Example = ' '.join("GSELTQDPAVSVALGQTVRITCQGDSLRNYYASWYQQKPRQAPVLVFYGKNNRPSGIPDRFSGSSSGNTASLTISGAQAEDEADYYCNSRDSSSNHLVFGGGTKLTVLSQ")
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
model_output = model(**encoded_input)
```

Sequence embeddings can be produced as follows:

```python
def get_sequence_embeddings(encoded_input, model_output):
    mask = encoded_input['attention_mask'].float()
    d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens
    # make sep token invisible
    for i in d:
        mask[i, d[i]] = 0
    mask[:, 0] = 0.0 # make cls token invisible
    mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size())
    sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1)
    sum_mask = torch.clamp(mask.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

seq_embeds = get_sequence_embeddings(encoded_input, model_output)
```

### Fine-tune

To save memory we recomend using [LoRA](https://doi.org/10.48550/arXiv.2106.09685):

```python
pip install git+https://github.com/huggingface/peft.git
pip install loralib
```

LoRA greatly reduces the number of trainable parameters and performs on-par or better than fine-tuning full model.

```python
from peft import LoraConfig, get_peft_model

def apply_lora_bert(model):
    config = LoraConfig(
        r=8, lora_alpha=32, 
        lora_dropout=0.3,
        target_modules=['query', 'value']
    )
    for param in model.parameters():
        param.requires_grad = False  # freeze the model - train adapters later
        if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
            param.data = param.data.to(torch.float32)
    model.gradient_checkpointing_enable()  # reduce number of stored activations
    model.enable_input_require_grads()
    model = get_peft_model(model, config)
    return model

model = apply_lora_bert(model)

model.print_trainable_parameters()
# trainable params: 294912 || all params: 85493760 || trainable%: 0.3449514911965505
```

### Citation
```
@article{Olsen2022,
  title={AbLang: An antibody language model for completing antibody sequences},
  author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
  journal={bioRxiv},
  doi={https://doi.org/10.1101/2022.01.20.477061},
  year={2022}
}
```