|
--- |
|
license: mit |
|
datasets: |
|
- COGNANO/VHHCorpus-2M |
|
library_name: transformers |
|
tags: |
|
- biology |
|
- protein |
|
- antibody |
|
- VHH |
|
--- |
|
|
|
## VHHBERT |
|
|
|
VHHBERT is a RoBERTa-based model pre-trained on two million VHH sequences in [VHHCorpus-2M](https://huggingface.co/datasets/COGNANO/VHHCorpus-2M). |
|
VHHBERT has the same model parameters as RoBERTa<sub>BASE</sub>, except that it used positional embeddings with a length of 185 to cover the maximum sequence length of 179 in VHHCorpus-2M. |
|
Further details on VHHBERT are described in our paper "[A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models](https://arxiv.org/abs/2405.18749).” |
|
|
|
## Usage |
|
|
|
The model and tokenizer can be loaded using the `transformers` library. |
|
|
|
```python |
|
from transformers import BertTokenizer, RobertaModel |
|
tokenizer = BertTokenizer.from_pretrained("COGNANO/VHHBERT") |
|
model = RobertaModel.from_pretrained("COGNANO/VHHBERT") |
|
``` |
|
|
|
## Links |
|
|
|
- Pre-training Corpus: https://huggingface.co/datasets/COGNANO/VHHCorpus-2M |
|
- Code: https://github.com/cognano/AVIDa-SARS-CoV-2 |
|
- Paper: https://arxiv.org/abs/2405.18749 |
|
|
|
## Citation |
|
|
|
If you use VHHBERT in your research, please cite the following paper. |
|
|
|
```bibtex |
|
@inproceedings{tsuruta2024sars, |
|
title={A {SARS}-{C}o{V}-2 Interaction Dataset and {VHH} Sequence Corpus for Antibody Language Models}, |
|
author={Hirofumi Tsuruta and Hiroyuki Yamazaki and Ryota Maeda and Ryotaro Tamura and Akihiro Imura}, |
|
booktitle={Advances in Neural Information Processing Systems 37}, |
|
year={2024} |
|
} |
|
``` |