GO-language / README.md
willdampier's picture
adding trainer, readme, and tokenizer
769b5d9
|
raw
history blame
3.66 kB
metadata
license: mit
datasets:
  - damlab/uniprot
metrics:
  - accuracy
widget:
  - text: >-
      involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470
      involved_in GO:0070372
    example_title: Function

GO-Language model

Table of Contents

Summary

This model was built as a way to encode the Gene Ontology definition of a protein as vector representation. It was trained on a collection of gene-ontology terms from model organisms. Each function was sorted by the ID number and combined with its annotation description ie (is_a, enables, located_in, etc). The model is tokenized such that each description and GO term is its own token. This is intended to be used as a translation model between PROT-BERT and GO-Language. That type of translation model will be useful for predicting the function of novel genes.

Model Description

This model was trained using the damlab/uniprot dataset on the go field with 256 token chunks and a 15% mask rate.

Intended Uses & Limitations

This model is a useful encapsulation of gene ontology functions. It allows both an exploration of gene-level similarities as well as comparisons between functional terms.

How to use

As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position.

from transformers import pipeline

unmasker = pipeline("fill-mask", model="damlab/GO-language")

unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")

[{'score': 0.1040298342704773,
  'token': 103,
  'token_str': 'GO:0002250',
  'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.018045395612716675,
  'token': 21,
  'token_str': 'GO:0005576',
  'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.015035462565720081,
  'token': 50,
  'token_str': 'GO:0000139',
  'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.01181247178465128,
  'token': 37,
  'token_str': 'GO:0007165',
  'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
 {'score': 0.01000668853521347,
  'token': 14,
  'token_str': 'GO:0005737',
  'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
]

Training Data

The dataset was trained using damlab/uniprot from a random initial model. The Gene Ontology functions were sorted (by ID number) along with annotating term.

Training Procedure

Preprocessing

All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.

Training

Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.

BibTeX Entry and Citation Info

[More Information Needed]