--- license: mit datasets: - damlab/uniprot metrics: - accuracy widget: - text: 'involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372' example_title: 'Function' --- # GO-Language model ## Table of Contents - [Summary](#model-summary) - [Model Description](#model-description) - [Intended Uses & Limitations](#intended-uses-&-limitations) - [How to Use](#how-to-use) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Preprocessing](#preprocessing) - [Training](#training) - [Evaluation Results](#evaluation-results) - [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info) ## Summary This model was built as a way to encode the Gene Ontology definition of a protein as vector representation. It was trained on a collection of gene-ontology terms from model organisms. Each function was sorted by the ID number and combined with its annotation description ie (`is_a`, `enables`, `located_in`, etc). The model is tokenized such that each description and GO term is its own token. This is intended to be used as a translation model between PROT-BERT and GO-Language. That type of translation model will be useful for predicting the function of novel genes. ## Model Description This model was trained using the damlab/uniprot dataset on the `go` field with 256 token chunks and a 15% mask rate. ## Intended Uses & Limitations This model is a useful encapsulation of gene ontology functions. It allows both an exploration of gene-level similarities as well as comparisons between functional terms. ## How to use As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position. ```python from transformers import pipeline unmasker = pipeline("fill-mask", model="damlab/GO-language") unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372") [{'score': 0.1040298342704773, 'token': 103, 'token_str': 'GO:0002250', 'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}, {'score': 0.018045395612716675, 'token': 21, 'token_str': 'GO:0005576', 'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}, {'score': 0.015035462565720081, 'token': 50, 'token_str': 'GO:0000139', 'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}, {'score': 0.01181247178465128, 'token': 37, 'token_str': 'GO:0007165', 'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}, {'score': 0.01000668853521347, 'token': 14, 'token_str': 'GO:0005737', 'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'} ] ``` ## Training Data The dataset was trained using [damlab/uniprot](https://huggingface.co/datasets/damlab/uniprot) from a random initial model. The Gene Ontology functions were sorted (by ID number) along with annotating term. ## Training Procedure ### Preprocessing All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation. ### Training Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. ## BibTeX Entry and Citation Info [More Information Needed]