willdampier
commited on
Commit
·
769b5d9
1
Parent(s):
4aa1946
adding trainer, readme, and tokenizer
Browse files- .gitignore +2 -0
- README.md +98 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- trainer.py +132 -0
.gitignore
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
trainer
|
2 |
+
.ipynb_checkpoints
|
README.md
CHANGED
@@ -1,3 +1,101 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
|
4 |
+
datasets:
|
5 |
+
- damlab/uniprot
|
6 |
+
metrics:
|
7 |
+
- accuracy
|
8 |
+
|
9 |
+
widget:
|
10 |
+
- text: 'involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'
|
11 |
+
example_title: 'Function'
|
12 |
+
|
13 |
---
|
14 |
+
|
15 |
+
# GO-Language model
|
16 |
+
|
17 |
+
## Table of Contents
|
18 |
+
- [Summary](#model-summary)
|
19 |
+
- [Model Description](#model-description)
|
20 |
+
- [Intended Uses & Limitations](#intended-uses-&-limitations)
|
21 |
+
- [How to Use](#how-to-use)
|
22 |
+
- [Training Data](#training-data)
|
23 |
+
- [Training Procedure](#training-procedure)
|
24 |
+
- [Preprocessing](#preprocessing)
|
25 |
+
- [Training](#training)
|
26 |
+
- [Evaluation Results](#evaluation-results)
|
27 |
+
- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)
|
28 |
+
|
29 |
+
## Summary
|
30 |
+
|
31 |
+
This model was built as a way to encode the Gene Ontology definition of a protein as vector representation.
|
32 |
+
It was trained on a collection of gene-ontology terms from model organisms.
|
33 |
+
Each function was sorted by the ID number and combined with its annotation description ie (`is_a`, `enables`, `located_in`, etc).
|
34 |
+
The model is tokenized such that each description and GO term is its own token.
|
35 |
+
This is intended to be used as a translation model between PROT-BERT and GO-Language.
|
36 |
+
That type of translation model will be useful for predicting the function of novel genes.
|
37 |
+
|
38 |
+
## Model Description
|
39 |
+
|
40 |
+
This model was trained using the damlab/uniprot dataset on the `go` field with 256 token chunks and a 15% mask rate.
|
41 |
+
|
42 |
+
|
43 |
+
## Intended Uses & Limitations
|
44 |
+
|
45 |
+
This model is a useful encapsulation of gene ontology functions.
|
46 |
+
It allows both an exploration of gene-level similarities as well as comparisons between functional terms.
|
47 |
+
|
48 |
+
## How to use
|
49 |
+
|
50 |
+
As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position.
|
51 |
+
|
52 |
+
```python
|
53 |
+
from transformers import pipeline
|
54 |
+
|
55 |
+
unmasker = pipeline("fill-mask", model="damlab/GO-language")
|
56 |
+
|
57 |
+
unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")
|
58 |
+
|
59 |
+
[{'score': 0.1040298342704773,
|
60 |
+
'token': 103,
|
61 |
+
'token_str': 'GO:0002250',
|
62 |
+
'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
|
63 |
+
{'score': 0.018045395612716675,
|
64 |
+
'token': 21,
|
65 |
+
'token_str': 'GO:0005576',
|
66 |
+
'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
|
67 |
+
{'score': 0.015035462565720081,
|
68 |
+
'token': 50,
|
69 |
+
'token_str': 'GO:0000139',
|
70 |
+
'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
|
71 |
+
{'score': 0.01181247178465128,
|
72 |
+
'token': 37,
|
73 |
+
'token_str': 'GO:0007165',
|
74 |
+
'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
|
75 |
+
{'score': 0.01000668853521347,
|
76 |
+
'token': 14,
|
77 |
+
'token_str': 'GO:0005737',
|
78 |
+
'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
|
79 |
+
]
|
80 |
+
|
81 |
+
```
|
82 |
+
|
83 |
+
## Training Data
|
84 |
+
|
85 |
+
The dataset was trained using [damlab/uniprot](https://huggingface.co/datasets/damlab/uniprot) from a random initial model.
|
86 |
+
The Gene Ontology functions were sorted (by ID number) along with annotating term.
|
87 |
+
|
88 |
+
## Training Procedure
|
89 |
+
|
90 |
+
### Preprocessing
|
91 |
+
|
92 |
+
All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
|
93 |
+
|
94 |
+
### Training
|
95 |
+
|
96 |
+
Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
|
97 |
+
|
98 |
+
|
99 |
+
## BibTeX Entry and Citation Info
|
100 |
+
|
101 |
+
[More Information Needed]
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "pad_token": "[PAD]", "cls_token": "[CLS]", "sep_token": "[SEP]", "mask_token": "[MASK]", "tokenizer_class": "PreTrainedTokenizerFast"}
|
trainer.py
ADDED
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import pandas as pd
|
2 |
+
import numpy as np
|
3 |
+
import matplotlib.pyplot as plt
|
4 |
+
import seaborn as sns
|
5 |
+
import datasets
|
6 |
+
import os
|
7 |
+
|
8 |
+
from tokenizers import Tokenizer
|
9 |
+
from tokenizers.models import WordLevel
|
10 |
+
from tokenizers.pre_tokenizers import WhitespaceSplit
|
11 |
+
from tokenizers.processors import TemplateProcessing
|
12 |
+
from tokenizers.trainers import WordLevelTrainer
|
13 |
+
from tokenizers.decoders import WordPiece
|
14 |
+
|
15 |
+
from transformers import PreTrainedTokenizerFast
|
16 |
+
from transformers import BertConfig, BertForMaskedLM, BertModel, BertForPreTraining
|
17 |
+
from transformers import (
|
18 |
+
AutoModelForMaskedLM,
|
19 |
+
AutoTokenizer,
|
20 |
+
DataCollatorForLanguageModeling,
|
21 |
+
EarlyStoppingCallback,
|
22 |
+
Trainer,
|
23 |
+
TrainingArguments,
|
24 |
+
)
|
25 |
+
|
26 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
|
27 |
+
os.environ["WANDB_DISABLED"] = "true"
|
28 |
+
|
29 |
+
NUM_TRAIN_EPOCHS = 100
|
30 |
+
|
31 |
+
go_uni = datasets.load_dataset("damlab/uniprot")["train"].filter(
|
32 |
+
lambda x: x["go"] is not None
|
33 |
+
)
|
34 |
+
|
35 |
+
|
36 |
+
tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"),)
|
37 |
+
tokenizer.pre_tokenizer = WhitespaceSplit()
|
38 |
+
|
39 |
+
trainer = WordLevelTrainer(
|
40 |
+
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "[BOS]", "[EOS]"]
|
41 |
+
)
|
42 |
+
tokenizer.train_from_iterator(go_uni["go"], trainer=trainer)
|
43 |
+
|
44 |
+
cls_token_id = tokenizer.token_to_id("[CLS]")
|
45 |
+
sep_token_id = tokenizer.token_to_id("[SEP]")
|
46 |
+
print(cls_token_id, sep_token_id)
|
47 |
+
|
48 |
+
tokenizer.post_processor = TemplateProcessing(
|
49 |
+
single=f"[CLS]:0 $A:0 [SEP]:0",
|
50 |
+
pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
|
51 |
+
special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
|
52 |
+
)
|
53 |
+
|
54 |
+
tokenizer.decoder = WordPiece(prefix="##")
|
55 |
+
|
56 |
+
wrapped_tokenizer = PreTrainedTokenizerFast(
|
57 |
+
tokenizer_object=tokenizer,
|
58 |
+
# tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
|
59 |
+
unk_token="[UNK]",
|
60 |
+
pad_token="[PAD]",
|
61 |
+
cls_token="[CLS]",
|
62 |
+
sep_token="[SEP]",
|
63 |
+
mask_token="[MASK]",
|
64 |
+
)
|
65 |
+
|
66 |
+
wrapped_tokenizer.save_pretrained("./")
|
67 |
+
|
68 |
+
|
69 |
+
def tkn_func(examples):
|
70 |
+
return wrapped_tokenizer(examples["go"], max_length=256, truncation=True)
|
71 |
+
|
72 |
+
|
73 |
+
tokenized_dataset = go_uni.map(
|
74 |
+
tkn_func, batched=True, remove_columns=go_uni.column_names
|
75 |
+
)
|
76 |
+
split_dataset = tokenized_dataset.train_test_split(seed=1234)
|
77 |
+
|
78 |
+
|
79 |
+
data_collator = DataCollatorForLanguageModeling(
|
80 |
+
tokenizer=wrapped_tokenizer, mlm_probability=0.15, pad_to_multiple_of=8,
|
81 |
+
)
|
82 |
+
|
83 |
+
training_args = TrainingArguments(
|
84 |
+
"trainer",
|
85 |
+
evaluation_strategy="steps",
|
86 |
+
load_best_model_at_end=False,
|
87 |
+
save_strategy="no",
|
88 |
+
logging_first_step=True,
|
89 |
+
logging_steps=10,
|
90 |
+
eval_steps=10,
|
91 |
+
num_train_epochs=NUM_TRAIN_EPOCHS,
|
92 |
+
warmup_steps=10,
|
93 |
+
weight_decay=0.01,
|
94 |
+
per_device_train_batch_size=24,
|
95 |
+
per_device_eval_batch_size=24,
|
96 |
+
gradient_accumulation_steps=96,
|
97 |
+
lr_scheduler_type="cosine_with_restarts",
|
98 |
+
)
|
99 |
+
|
100 |
+
|
101 |
+
encoder_bert = BertConfig(
|
102 |
+
vocab_size=tokenizer.get_vocab_size(),
|
103 |
+
hidden_size=1024,
|
104 |
+
num_hidden_layers=12,
|
105 |
+
num_attention_heads=32,
|
106 |
+
intermediate_size=3072,
|
107 |
+
hidden_act="gelu",
|
108 |
+
hidden_dropout_prob=0.1,
|
109 |
+
attention_probs_dropout_prob=0.1,
|
110 |
+
max_position_embeddings=256,
|
111 |
+
type_vocab_size=2,
|
112 |
+
initializer_range=0.02,
|
113 |
+
layer_norm_eps=1e-12,
|
114 |
+
pad_token_id=0,
|
115 |
+
position_embedding_type="absolute",
|
116 |
+
)
|
117 |
+
|
118 |
+
|
119 |
+
def model_init():
|
120 |
+
return BertForMaskedLM(encoder_bert)
|
121 |
+
|
122 |
+
|
123 |
+
trainer = Trainer(
|
124 |
+
model_init=model_init,
|
125 |
+
args=training_args,
|
126 |
+
train_dataset=split_dataset["train"],
|
127 |
+
eval_dataset=split_dataset["test"],
|
128 |
+
data_collator=data_collator,
|
129 |
+
)
|
130 |
+
|
131 |
+
results = trainer.train()
|
132 |
+
trainer.save_model("./")
|