---
license: mit

datasets:
  - damlab/uniprot
metrics:
  - accuracy

widget:
 - text: 'involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'
   example_title: 'Function'

---

# GO-Language model

## Table of Contents
- [Summary](#model-summary)
- [Model Description](#model-description)
- [Intended Uses & Limitations](#intended-uses-&-limitations)
- [How to Use](#how-to-use)
- [Training Data](#training-data)
- [Training Procedure](#training-procedure)
  - [Preprocessing](#preprocessing)
  - [Training](#training)
- [Evaluation Results](#evaluation-results)
- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)

## Summary

This model was built as a way to encode the Gene Ontology definition of a protein as vector representation.
It was trained on a collection of gene-ontology terms from model organisms.
Each function was sorted by the ID number and combined with its annotation description ie (`is_a`, `enables`, `located_in`, etc).
The model is tokenized such that each description and GO term is its own token.
This is intended to be used as a translation model between PROT-BERT and GO-Language.
That type of translation model will be useful for predicting the function of novel genes.

## Model Description

This model was trained using the damlab/uniprot dataset on the `go` field with 256 token chunks and a 15% mask rate.


## Intended Uses & Limitations

This model is a useful encapsulation of gene ontology functions.
It allows both an exploration of gene-level similarities as well as comparisons between functional terms.

## How to use

As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position.

```python
from transformers import pipeline

unmasker = pipeline("fill-mask", model="damlab/GO-language")

unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")

[{'score': 0.1040298342704773,
  'token': 103,
  'token_str': 'GO:0002250',
  'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.018045395612716675,
  'token': 21,
  'token_str': 'GO:0005576',
  'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.015035462565720081,
  'token': 50,
  'token_str': 'GO:0000139',
  'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.01181247178465128,
  'token': 37,
  'token_str': 'GO:0007165',
  'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.01000668853521347,
  'token': 14,
  'token_str': 'GO:0005737',
  'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
]

```

## Training Data

The dataset was trained using [damlab/uniprot](https://huggingface.co/datasets/damlab/uniprot) from a random initial model.
The Gene Ontology functions were sorted (by ID number) along with annotating term.

## Training Procedure

### Preprocessing

All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.

### Training

Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.


## BibTeX Entry and Citation Info

[More Information Needed]