qilowoq
/

AbLang_light

@@ -12,20 +12,19 @@ tags:
 - OAS
 ---
-# AbLang model for light chains
 This is a huggingface version of AbLang: A language model for antibodies. It was introduced in
 [this paper](https://doi.org/10.1101/2022.01.20.477061) and first released in
 [this repository](https://github.com/oxpig/AbLang). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
-# Intended uses & limitations
 The model could be used for protein feature extraction or to be fine-tuned on downstream tasks (TBA).
 ### How to use
-Since this is a custom model, you need to install additional dependencies:
 ```python
 pip install ablang
@@ -47,7 +46,55 @@ model_output = model(**encoded_input)
 Sequence embeddings can be produced as follows:
 ```python
-seq_embs = model_output.last_hidden_state[:, 0, :]
 ```
 ### Citation
@@ -59,4 +106,4 @@ seq_embs = model_output.last_hidden_state[:, 0, :]
   doi={https://doi.org/10.1101/2022.01.20.477061},
   year={2022}
 }
-```

 - OAS
 ---
+### AbLang model for light chains
 This is a huggingface version of AbLang: A language model for antibodies. It was introduced in
 [this paper](https://doi.org/10.1101/2022.01.20.477061) and first released in
 [this repository](https://github.com/oxpig/AbLang). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
+### Intended uses & limitations
 The model could be used for protein feature extraction or to be fine-tuned on downstream tasks (TBA).
 ### How to use
+Here is how to use this model to get the features of a given antibody sequence in PyTorch:
 ```python
 pip install ablang
 Sequence embeddings can be produced as follows:
 ```python
+def get_sequence_embeddings(encoded_input, model_output):
+    mask = encoded_input['attention_mask'].float()
+    d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens
+    # make sep token invisible
+    for i in d:
+        mask[i, d[i]] = 0
+    mask[:, 0] = 0.0 # make cls token invisible
+    mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size())
+    sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1)
+    sum_mask = torch.clamp(mask.sum(1), min=1e-9)
+    return sum_embeddings / sum_mask
+seq_embeds = get_sequence_embeddings(encoded_input, model_output)
+```
+### Fine-tune
+To save memory we recomend using [LoRA](https://doi.org/10.48550/arXiv.2106.09685):
+```python
+pip install git+https://github.com/huggingface/peft.git
+pip install loralib
+```
+LoRA greatly reduces the number of trainable parameters and performs on-par or better than fine-tuning full model.
+```python
+from peft import LoraConfig, get_peft_model
+def apply_lora_bert(model):
+    config = LoraConfig(
+        r=8, lora_alpha=32,
+        lora_dropout=0.3,
+        target_modules=['query', 'value']
+    )
+    for param in model.parameters():
+        param.requires_grad = False  # freeze the model - train adapters later
+        if param.ndim == 1:
+        # cast the small parameters (e.g. layernorm) to fp32 for stability
+            param.data = param.data.to(torch.float32)
+    model.gradient_checkpointing_enable()  # reduce number of stored activations
+    model.enable_input_require_grads()
+    model = get_peft_model(model, config)
+    return model
+model = apply_lora_bert(model)
+model.print_trainable_parameters()
+# trainable params: 294912 || all params: 85493760 || trainable%: 0.3449514911965505
 ```
 ### Citation
   doi={https://doi.org/10.1101/2022.01.20.477061},
   year={2022}
 }
+```