File size: 1,526 Bytes
1ad26c6
 
852bf60
 
 
 
 
 
 
 
1ad26c6
 
 
 
 
 
b780819
1ad26c6
 
 
 
 
 
 
9498a56
 
1ad26c6
 
 
 
 
 
b780819
1ad26c6
 
 
 
 
 
cd6341d
b780819
 
cd6341d
b780819
 
1ad26c6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: mit
datasets:
- COGNANO/VHHCorpus-2M
library_name: transformers
tags:
- biology
- protein
- antibody
- VHH
---

## VHHBERT

VHHBERT is a RoBERTa-based model pre-trained on two million VHH sequences in [VHHCorpus-2M](https://huggingface.co/datasets/COGNANO/VHHCorpus-2M).
VHHBERT has the same model parameters as RoBERTa<sub>BASE</sub>, except that it used positional embeddings with a length of 185 to cover the maximum sequence length of 179 in VHHCorpus-2M.
Further details on VHHBERT are described in our paper "[A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models](https://arxiv.org/abs/2405.18749).”

## Usage

The model and tokenizer can be loaded using the `transformers` library.

```python
from transformers import BertTokenizer, RobertaModel
tokenizer = BertTokenizer.from_pretrained("COGNANO/VHHBERT")
model = RobertaModel.from_pretrained("COGNANO/VHHBERT")
```

## Links

- Pre-training Corpus: https://huggingface.co/datasets/COGNANO/VHHCorpus-2M
- Code: https://github.com/cognano/AVIDa-SARS-CoV-2
- Paper: https://arxiv.org/abs/2405.18749

## Citation

If you use VHHBERT in your research, please cite the following paper.

```bibtex
@inproceedings{tsuruta2024sars,
  title={A {SARS}-{C}o{V}-2 Interaction Dataset and {VHH} Sequence Corpus for Antibody Language Models},
  author={Hirofumi Tsuruta and Hiroyuki Yamazaki and Ryota Maeda and Ryotaro Tamura and Akihiro Imura},
  booktitle={Advances in Neural Information Processing Systems 37},
  year={2024}
}
```