metadata
library_name: transformers
tags:
- AT
- masked-language-modeling
- protein-annotations
license: mit
AT (Annotation Transformer, MLM-pretrained, preset at_base)
BERT-style masked annotation modeling over the 88k Annotation Vocabulary
(Hallee et al. 2024). Trained on lhallee/AV_large.
Training
| Setting | Value |
|---|---|
| Preset | at_base |
| Dataset | lhallee/AV_large |
| Sequence length | 192 (fixed; static shapes for torch.compile) |
| Mask probability | 0.15 |
| Batch size | 1024 |
| Steps | 100000 (configured for 100000) |
| Optimizer | AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01) |
| Schedule | linear warmup over 2000 -> cosine decay to 0.1 * lr |
| Precision | bf16 |
Final validation metrics
- loss: 0.6910
- macro_f1: 0.7557
- macro_precision: 0.7541
- macro_recall: 0.7669
- mcc: 0.9103
- perplexity: 1.9958
- top10_acc: 0.9576
- top1_acc: 0.9106
- top25_acc: 0.9612
- top5_acc: 0.9536
How to use
from models.annotation_transformer import AnnotationTransformer
AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base")
pooled = AT(input_ids, attention_mask) # (batch, hidden_size)
The downstream consumer in this repo is the vec2vec translator (sets 7/8): annotations are pre-embedded with this AT (frozen) and mapped to / from a PLM embedding space via the same paired-batch contrastive recipe used in sets 1-3.
References
- Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary.
- Jha et al. 2025, arXiv:2505.12540 -- vec2vec.