license: mit
datasets:
- damlab/uniprot
metrics:
- accuracy
widget:
- text: >-
involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470
involved_in GO:0070372
example_title: Function
GO-Language model
Table of Contents
- Summary
- Model Description
- Intended Uses & Limitations
- How to Use
- Training Data
- Training Procedure
- Evaluation Results
- BibTeX Entry and Citation Info
Summary
This model was built as a way to encode the Gene Ontology definition of a protein as vector representation.
It was trained on a collection of gene-ontology terms from model organisms.
Each function was sorted by the ID number and combined with its annotation description ie (is_a
, enables
, located_in
, etc).
The model is tokenized such that each description and GO term is its own token.
This is intended to be used as a translation model between PROT-BERT and GO-Language.
That type of translation model will be useful for predicting the function of novel genes.
Model Description
This model was trained using the damlab/uniprot dataset on the go
field with 256 token chunks and a 15% mask rate.
Intended Uses & Limitations
This model is a useful encapsulation of gene ontology functions. It allows both an exploration of gene-level similarities as well as comparisons between functional terms.
How to use
As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position.
from transformers import pipeline
unmasker = pipeline("fill-mask", model="damlab/GO-language")
unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")
[{'score': 0.1040298342704773,
'token': 103,
'token_str': 'GO:0002250',
'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.018045395612716675,
'token': 21,
'token_str': 'GO:0005576',
'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.015035462565720081,
'token': 50,
'token_str': 'GO:0000139',
'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.01181247178465128,
'token': 37,
'token_str': 'GO:0007165',
'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.01000668853521347,
'token': 14,
'token_str': 'GO:0005737',
'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
]
Training Data
The dataset was trained using damlab/uniprot from a random initial model. The Gene Ontology functions were sorted (by ID number) along with annotating term.
Training Procedure
Preprocessing
All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
Training
Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
BibTeX Entry and Citation Info
[More Information Needed]