GO-language / README.md
willdampier's picture
adding trainer, readme, and tokenizer
769b5d9
---
license: mit
datasets:
- damlab/uniprot
metrics:
- accuracy
widget:
- text: 'involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'
example_title: 'Function'
---
# GO-Language model
## Table of Contents
- [Summary](#model-summary)
- [Model Description](#model-description)
- [Intended Uses & Limitations](#intended-uses-&-limitations)
- [How to Use](#how-to-use)
- [Training Data](#training-data)
- [Training Procedure](#training-procedure)
- [Preprocessing](#preprocessing)
- [Training](#training)
- [Evaluation Results](#evaluation-results)
- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)
## Summary
This model was built as a way to encode the Gene Ontology definition of a protein as vector representation.
It was trained on a collection of gene-ontology terms from model organisms.
Each function was sorted by the ID number and combined with its annotation description ie (`is_a`, `enables`, `located_in`, etc).
The model is tokenized such that each description and GO term is its own token.
This is intended to be used as a translation model between PROT-BERT and GO-Language.
That type of translation model will be useful for predicting the function of novel genes.
## Model Description
This model was trained using the damlab/uniprot dataset on the `go` field with 256 token chunks and a 15% mask rate.
## Intended Uses & Limitations
This model is a useful encapsulation of gene ontology functions.
It allows both an exploration of gene-level similarities as well as comparisons between functional terms.
## How to use
As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position.
```python
from transformers import pipeline
unmasker = pipeline("fill-mask", model="damlab/GO-language")
unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")
[{'score': 0.1040298342704773,
'token': 103,
'token_str': 'GO:0002250',
'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.018045395612716675,
'token': 21,
'token_str': 'GO:0005576',
'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.015035462565720081,
'token': 50,
'token_str': 'GO:0000139',
'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.01181247178465128,
'token': 37,
'token_str': 'GO:0007165',
'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
{'score': 0.01000668853521347,
'token': 14,
'token_str': 'GO:0005737',
'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
]
```
## Training Data
The dataset was trained using [damlab/uniprot](https://huggingface.co/datasets/damlab/uniprot) from a random initial model.
The Gene Ontology functions were sorted (by ID number) along with annotating term.
## Training Procedure
### Preprocessing
All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
### Training
Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
## BibTeX Entry and Citation Info
[More Information Needed]