contacts-v1 tokenizer

Tokenizer for the contacts-v1 protein-document format produced by MarinFold, used by the contacts-v1-5x subset of timodonnell/protein-docs.

This is a mirror for ease of AutoTokenizer.from_pretrained("timodonnell/contacts-v1-tokenizer"). Canonical location is co-located with the data at open-athena/MarinFold:data/document_structures/contacts_v1/tokenizer/ (new MarinFold convention: tokenizer lives in the bucket next to its data).

load_dataset("timodonnell/protein-docs", "contacts-v1-5x") documents tokenize directly with this tokenizer — no extra mapping needed.

Vocab

2,845 tokens total (2,843 domain tokens + <pad> + <eos>). The vocab is the union of contacts-v1's own 5 native tokens and the entire contacts-and-distances-v1 vocab — so a contacts-v1 model can be fine-tuned on contacts-and-distances-v1 documents (and vice versa) without swapping tokenizers.

  • 5 contacts-v1 native tokens — <contacts-v1>, <n-term>, <c-term>, <contact>, <think>.
  • contacts-and-distances-v1 vocab (reused by emission) — position tokens <p0>…<p2700>, section markers <begin_sequence> / <begin_statements>, amino-acid tokens <ALA>…<VAL>, <UNK>, <end>, plus the rest of contacts-and-distances-v1's tokens (modes, distance bins, atoms, pLDDT bins) which contacts-v1 documents do not emit but the tokenizer can encode.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("timodonnell/contacts-v1-tokenizer")
ids = tok.encode("<contacts-v1> <begin_sequence> <p20> <MET>", add_special_tokens=False)
print(ids)
print(tok.decode(ids))

Provenance

Built by marinfold v0.1.0 at git rev dd025aab (uv run contacts-v1 tokenizer --push …), with the full vocab in marinfold/.../document_structures/contacts_v1/vocab.py. Created as part of MarinFold issue #53.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support