willdampier commited on
Commit
769b5d9
1 Parent(s): 4aa1946

adding trainer, readme, and tokenizer

Browse files
Files changed (6) hide show
  1. .gitignore +2 -0
  2. README.md +98 -0
  3. special_tokens_map.json +1 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +1 -0
  6. trainer.py +132 -0
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ trainer
2
+ .ipynb_checkpoints
README.md CHANGED
@@ -1,3 +1,101 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+
4
+ datasets:
5
+ - damlab/uniprot
6
+ metrics:
7
+ - accuracy
8
+
9
+ widget:
10
+ - text: 'involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'
11
+ example_title: 'Function'
12
+
13
  ---
14
+
15
+ # GO-Language model
16
+
17
+ ## Table of Contents
18
+ - [Summary](#model-summary)
19
+ - [Model Description](#model-description)
20
+ - [Intended Uses & Limitations](#intended-uses-&-limitations)
21
+ - [How to Use](#how-to-use)
22
+ - [Training Data](#training-data)
23
+ - [Training Procedure](#training-procedure)
24
+ - [Preprocessing](#preprocessing)
25
+ - [Training](#training)
26
+ - [Evaluation Results](#evaluation-results)
27
+ - [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)
28
+
29
+ ## Summary
30
+
31
+ This model was built as a way to encode the Gene Ontology definition of a protein as vector representation.
32
+ It was trained on a collection of gene-ontology terms from model organisms.
33
+ Each function was sorted by the ID number and combined with its annotation description ie (`is_a`, `enables`, `located_in`, etc).
34
+ The model is tokenized such that each description and GO term is its own token.
35
+ This is intended to be used as a translation model between PROT-BERT and GO-Language.
36
+ That type of translation model will be useful for predicting the function of novel genes.
37
+
38
+ ## Model Description
39
+
40
+ This model was trained using the damlab/uniprot dataset on the `go` field with 256 token chunks and a 15% mask rate.
41
+
42
+
43
+ ## Intended Uses & Limitations
44
+
45
+ This model is a useful encapsulation of gene ontology functions.
46
+ It allows both an exploration of gene-level similarities as well as comparisons between functional terms.
47
+
48
+ ## How to use
49
+
50
+ As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position.
51
+
52
+ ```python
53
+ from transformers import pipeline
54
+
55
+ unmasker = pipeline("fill-mask", model="damlab/GO-language")
56
+
57
+ unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")
58
+
59
+ [{'score': 0.1040298342704773,
60
+ 'token': 103,
61
+ 'token_str': 'GO:0002250',
62
+ 'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
63
+ {'score': 0.018045395612716675,
64
+ 'token': 21,
65
+ 'token_str': 'GO:0005576',
66
+ 'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
67
+ {'score': 0.015035462565720081,
68
+ 'token': 50,
69
+ 'token_str': 'GO:0000139',
70
+ 'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
71
+ {'score': 0.01181247178465128,
72
+ 'token': 37,
73
+ 'token_str': 'GO:0007165',
74
+ 'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
75
+ {'score': 0.01000668853521347,
76
+ 'token': 14,
77
+ 'token_str': 'GO:0005737',
78
+ 'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
79
+ ]
80
+
81
+ ```
82
+
83
+ ## Training Data
84
+
85
+ The dataset was trained using [damlab/uniprot](https://huggingface.co/datasets/damlab/uniprot) from a random initial model.
86
+ The Gene Ontology functions were sorted (by ID number) along with annotating term.
87
+
88
+ ## Training Procedure
89
+
90
+ ### Preprocessing
91
+
92
+ All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
93
+
94
+ ### Training
95
+
96
+ Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
97
+
98
+
99
+ ## BibTeX Entry and Citation Info
100
+
101
+ [More Information Needed]
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "pad_token": "[PAD]", "cls_token": "[CLS]", "sep_token": "[SEP]", "mask_token": "[MASK]", "tokenizer_class": "PreTrainedTokenizerFast"}
trainer.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import numpy as np
3
+ import matplotlib.pyplot as plt
4
+ import seaborn as sns
5
+ import datasets
6
+ import os
7
+
8
+ from tokenizers import Tokenizer
9
+ from tokenizers.models import WordLevel
10
+ from tokenizers.pre_tokenizers import WhitespaceSplit
11
+ from tokenizers.processors import TemplateProcessing
12
+ from tokenizers.trainers import WordLevelTrainer
13
+ from tokenizers.decoders import WordPiece
14
+
15
+ from transformers import PreTrainedTokenizerFast
16
+ from transformers import BertConfig, BertForMaskedLM, BertModel, BertForPreTraining
17
+ from transformers import (
18
+ AutoModelForMaskedLM,
19
+ AutoTokenizer,
20
+ DataCollatorForLanguageModeling,
21
+ EarlyStoppingCallback,
22
+ Trainer,
23
+ TrainingArguments,
24
+ )
25
+
26
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
27
+ os.environ["WANDB_DISABLED"] = "true"
28
+
29
+ NUM_TRAIN_EPOCHS = 100
30
+
31
+ go_uni = datasets.load_dataset("damlab/uniprot")["train"].filter(
32
+ lambda x: x["go"] is not None
33
+ )
34
+
35
+
36
+ tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"),)
37
+ tokenizer.pre_tokenizer = WhitespaceSplit()
38
+
39
+ trainer = WordLevelTrainer(
40
+ special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "[BOS]", "[EOS]"]
41
+ )
42
+ tokenizer.train_from_iterator(go_uni["go"], trainer=trainer)
43
+
44
+ cls_token_id = tokenizer.token_to_id("[CLS]")
45
+ sep_token_id = tokenizer.token_to_id("[SEP]")
46
+ print(cls_token_id, sep_token_id)
47
+
48
+ tokenizer.post_processor = TemplateProcessing(
49
+ single=f"[CLS]:0 $A:0 [SEP]:0",
50
+ pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
51
+ special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
52
+ )
53
+
54
+ tokenizer.decoder = WordPiece(prefix="##")
55
+
56
+ wrapped_tokenizer = PreTrainedTokenizerFast(
57
+ tokenizer_object=tokenizer,
58
+ # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
59
+ unk_token="[UNK]",
60
+ pad_token="[PAD]",
61
+ cls_token="[CLS]",
62
+ sep_token="[SEP]",
63
+ mask_token="[MASK]",
64
+ )
65
+
66
+ wrapped_tokenizer.save_pretrained("./")
67
+
68
+
69
+ def tkn_func(examples):
70
+ return wrapped_tokenizer(examples["go"], max_length=256, truncation=True)
71
+
72
+
73
+ tokenized_dataset = go_uni.map(
74
+ tkn_func, batched=True, remove_columns=go_uni.column_names
75
+ )
76
+ split_dataset = tokenized_dataset.train_test_split(seed=1234)
77
+
78
+
79
+ data_collator = DataCollatorForLanguageModeling(
80
+ tokenizer=wrapped_tokenizer, mlm_probability=0.15, pad_to_multiple_of=8,
81
+ )
82
+
83
+ training_args = TrainingArguments(
84
+ "trainer",
85
+ evaluation_strategy="steps",
86
+ load_best_model_at_end=False,
87
+ save_strategy="no",
88
+ logging_first_step=True,
89
+ logging_steps=10,
90
+ eval_steps=10,
91
+ num_train_epochs=NUM_TRAIN_EPOCHS,
92
+ warmup_steps=10,
93
+ weight_decay=0.01,
94
+ per_device_train_batch_size=24,
95
+ per_device_eval_batch_size=24,
96
+ gradient_accumulation_steps=96,
97
+ lr_scheduler_type="cosine_with_restarts",
98
+ )
99
+
100
+
101
+ encoder_bert = BertConfig(
102
+ vocab_size=tokenizer.get_vocab_size(),
103
+ hidden_size=1024,
104
+ num_hidden_layers=12,
105
+ num_attention_heads=32,
106
+ intermediate_size=3072,
107
+ hidden_act="gelu",
108
+ hidden_dropout_prob=0.1,
109
+ attention_probs_dropout_prob=0.1,
110
+ max_position_embeddings=256,
111
+ type_vocab_size=2,
112
+ initializer_range=0.02,
113
+ layer_norm_eps=1e-12,
114
+ pad_token_id=0,
115
+ position_embedding_type="absolute",
116
+ )
117
+
118
+
119
+ def model_init():
120
+ return BertForMaskedLM(encoder_bert)
121
+
122
+
123
+ trainer = Trainer(
124
+ model_init=model_init,
125
+ args=training_args,
126
+ train_dataset=split_dataset["train"],
127
+ eval_dataset=split_dataset["test"],
128
+ data_collator=data_collator,
129
+ )
130
+
131
+ results = trainer.train()
132
+ trainer.save_model("./")