contacts-v1 tokenizer
Tokenizer for the contacts-v1 protein-document format produced by MarinFold, used by the contacts-v1-5x subset of timodonnell/protein-docs.
This is a mirror for ease of
AutoTokenizer.from_pretrained("timodonnell/contacts-v1-tokenizer"). Canonical location is co-located with the data atopen-athena/MarinFold:data/document_structures/contacts_v1/tokenizer/(new MarinFold convention: tokenizer lives in the bucket next to its data).
load_dataset("timodonnell/protein-docs", "contacts-v1-5x") documents tokenize directly with this tokenizer — no extra mapping needed.
Vocab
2,845 tokens total (2,843 domain tokens + <pad> + <eos>). The vocab is the union of contacts-v1's own 5 native tokens and the entire contacts-and-distances-v1 vocab — so a contacts-v1 model can be fine-tuned on contacts-and-distances-v1 documents (and vice versa) without swapping tokenizers.
- 5 contacts-v1 native tokens —
<contacts-v1>,<n-term>,<c-term>,<contact>,<think>. contacts-and-distances-v1vocab (reused by emission) — position tokens<p0>…<p2700>, section markers<begin_sequence>/<begin_statements>, amino-acid tokens<ALA>…<VAL>,<UNK>,<end>, plus the rest ofcontacts-and-distances-v1's tokens (modes, distance bins, atoms, pLDDT bins) which contacts-v1 documents do not emit but the tokenizer can encode.
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("timodonnell/contacts-v1-tokenizer")
ids = tok.encode("<contacts-v1> <begin_sequence> <p20> <MET>", add_special_tokens=False)
print(ids)
print(tok.decode(ids))
Provenance
Built by marinfold v0.1.0 at git rev dd025aab (uv run contacts-v1 tokenizer --push …), with the full vocab in marinfold/.../document_structures/contacts_v1/vocab.py. Created as part of MarinFold issue #53.