adding trainer, readme, and tokenizer

Browse files

Files changed (6) hide show

.gitignore +2 -0
README.md +98 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
trainer.py +132 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ trainer
2	+ .ipynb_checkpoints

README.md CHANGED Viewed

@@ -1,3 +1,101 @@
 ---
 license: mit
 ---

 ---
 license: mit
+datasets:
+  - damlab/uniprot
+metrics:
+  - accuracy
+widget:
+ - text: 'involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'
+   example_title: 'Function'
 ---
+# GO-Language model
+## Table of Contents
+- [Summary](#model-summary)
+- [Model Description](#model-description)
+- [Intended Uses & Limitations](#intended-uses-&-limitations)
+- [How to Use](#how-to-use)
+- [Training Data](#training-data)
+- [Training Procedure](#training-procedure)
+  - [Preprocessing](#preprocessing)
+  - [Training](#training)
+- [Evaluation Results](#evaluation-results)
+- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)
+## Summary
+This model was built as a way to encode the Gene Ontology definition of a protein as vector representation.
+It was trained on a collection of gene-ontology terms from model organisms.
+Each function was sorted by the ID number and combined with its annotation description ie (`is_a`, `enables`, `located_in`, etc).
+The model is tokenized such that each description and GO term is its own token.
+This is intended to be used as a translation model between PROT-BERT and GO-Language.
+That type of translation model will be useful for predicting the function of novel genes.
+## Model Description
+This model was trained using the damlab/uniprot dataset on the `go` field with 256 token chunks and a 15% mask rate.
+## Intended Uses & Limitations
+This model is a useful encapsulation of gene ontology functions.
+It allows both an exploration of gene-level similarities as well as comparisons between functional terms.
+## How to use
+As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position.
+```python
+from transformers import pipeline
+unmasker = pipeline("fill-mask", model="damlab/GO-language")
+unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")
+[{'score': 0.1040298342704773,
+  'token': 103,
+  'token_str': 'GO:0002250',
+  'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
+ {'score': 0.018045395612716675,
+  'token': 21,
+  'token_str': 'GO:0005576',
+  'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
+ {'score': 0.015035462565720081,
+  'token': 50,
+  'token_str': 'GO:0000139',
+  'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
+ {'score': 0.01181247178465128,
+  'token': 37,
+  'token_str': 'GO:0007165',
+  'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
+ {'score': 0.01000668853521347,
+  'token': 14,
+  'token_str': 'GO:0005737',
+  'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
+]
+```
+## Training Data
+The dataset was trained using [damlab/uniprot](https://huggingface.co/datasets/damlab/uniprot) from a random initial model.
+The Gene Ontology functions were sorted (by ID number) along with annotating term.
+## Training Procedure
+### Preprocessing
+All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
+### Training
+Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
+## BibTeX Entry and Citation Info
+[More Information Needed]

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "pad_token": "[PAD]", "cls_token": "[CLS]", "sep_token": "[SEP]", "mask_token": "[MASK]", "tokenizer_class": "PreTrainedTokenizerFast"}

trainer.py ADDED Viewed

	@@ -0,0 +1,132 @@

+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+import datasets
+import os
+from tokenizers import Tokenizer
+from tokenizers.models import WordLevel
+from tokenizers.pre_tokenizers import WhitespaceSplit
+from tokenizers.processors import TemplateProcessing
+from tokenizers.trainers import WordLevelTrainer
+from tokenizers.decoders import WordPiece
+from transformers import PreTrainedTokenizerFast
+from transformers import BertConfig, BertForMaskedLM, BertModel, BertForPreTraining
+from transformers import (
+    AutoModelForMaskedLM,
+    AutoTokenizer,
+    DataCollatorForLanguageModeling,
+    EarlyStoppingCallback,
+    Trainer,
+    TrainingArguments,
+)
+os.environ["CUDA_VISIBLE_DEVICES"] = "0"
+os.environ["WANDB_DISABLED"] = "true"
+NUM_TRAIN_EPOCHS = 100
+go_uni = datasets.load_dataset("damlab/uniprot")["train"].filter(
+    lambda x: x["go"] is not None
+)
+tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"),)
+tokenizer.pre_tokenizer = WhitespaceSplit()
+trainer = WordLevelTrainer(
+    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "[BOS]", "[EOS]"]
+)
+tokenizer.train_from_iterator(go_uni["go"], trainer=trainer)
+cls_token_id = tokenizer.token_to_id("[CLS]")
+sep_token_id = tokenizer.token_to_id("[SEP]")
+print(cls_token_id, sep_token_id)
+tokenizer.post_processor = TemplateProcessing(
+    single=f"[CLS]:0 $A:0 [SEP]:0",
+    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
+    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
+)
+tokenizer.decoder = WordPiece(prefix="##")
+wrapped_tokenizer = PreTrainedTokenizerFast(
+    tokenizer_object=tokenizer,
+    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
+    unk_token="[UNK]",
+    pad_token="[PAD]",
+    cls_token="[CLS]",
+    sep_token="[SEP]",
+    mask_token="[MASK]",
+)
+wrapped_tokenizer.save_pretrained("./")
+def tkn_func(examples):
+    return wrapped_tokenizer(examples["go"], max_length=256, truncation=True)
+tokenized_dataset = go_uni.map(
+    tkn_func, batched=True, remove_columns=go_uni.column_names
+)
+split_dataset = tokenized_dataset.train_test_split(seed=1234)
+data_collator = DataCollatorForLanguageModeling(
+    tokenizer=wrapped_tokenizer, mlm_probability=0.15, pad_to_multiple_of=8,
+)
+training_args = TrainingArguments(
+    "trainer",
+    evaluation_strategy="steps",
+    load_best_model_at_end=False,
+    save_strategy="no",
+    logging_first_step=True,
+    logging_steps=10,
+    eval_steps=10,
+    num_train_epochs=NUM_TRAIN_EPOCHS,
+    warmup_steps=10,
+    weight_decay=0.01,
+    per_device_train_batch_size=24,
+    per_device_eval_batch_size=24,
+    gradient_accumulation_steps=96,
+    lr_scheduler_type="cosine_with_restarts",
+)
+encoder_bert = BertConfig(
+    vocab_size=tokenizer.get_vocab_size(),
+    hidden_size=1024,
+    num_hidden_layers=12,
+    num_attention_heads=32,
+    intermediate_size=3072,
+    hidden_act="gelu",
+    hidden_dropout_prob=0.1,
+    attention_probs_dropout_prob=0.1,
+    max_position_embeddings=256,
+    type_vocab_size=2,
+    initializer_range=0.02,
+    layer_norm_eps=1e-12,
+    pad_token_id=0,
+    position_embedding_type="absolute",
+)
+def model_init():
+    return BertForMaskedLM(encoder_bert)
+trainer = Trainer(
+    model_init=model_init,
+    args=training_args,
+    train_dataset=split_dataset["train"],
+    eval_dataset=split_dataset["test"],
+    data_collator=data_collator,
+)
+results = trainer.train()
+trainer.save_model("./")