GO-language / README.md

adding trainer, readme, and tokenizer

769b5d9 about 2 years ago

No virus

3.66 kB

	---
	license: mit

	datasets:
	- damlab/uniprot
	metrics:
	- accuracy

	widget:
	- text: 'involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'
	example_title: 'Function'

	---

	# GO-Language model

	## Table of Contents
	- [Summary](#model-summary)
	- [Model Description](#model-description)
	- [Intended Uses & Limitations](#intended-uses-&-limitations)
	- [How to Use](#how-to-use)
	- [Training Data](#training-data)
	- [Training Procedure](#training-procedure)
	- [Preprocessing](#preprocessing)
	- [Training](#training)
	- [Evaluation Results](#evaluation-results)
	- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)

	## Summary

	This model was built as a way to encode the Gene Ontology definition of a protein as vector representation.
	It was trained on a collection of gene-ontology terms from model organisms.
	Each function was sorted by the ID number and combined with its annotation description ie (`is_a`, `enables`, `located_in`, etc).
	The model is tokenized such that each description and GO term is its own token.
	This is intended to be used as a translation model between PROT-BERT and GO-Language.
	That type of translation model will be useful for predicting the function of novel genes.

	## Model Description

	This model was trained using the damlab/uniprot dataset on the `go` field with 256 token chunks and a 15% mask rate.


	## Intended Uses & Limitations

	This model is a useful encapsulation of gene ontology functions.
	It allows both an exploration of gene-level similarities as well as comparisons between functional terms.

	## How to use

	As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position.

	```python
	from transformers import pipeline

	unmasker = pipeline("fill-mask", model="damlab/GO-language")

	unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372")

	[{'score': 0.1040298342704773,
	'token': 103,
	'token_str': 'GO:0002250',
	'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
	{'score': 0.018045395612716675,
	'token': 21,
	'token_str': 'GO:0005576',
	'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
	{'score': 0.015035462565720081,
	'token': 50,
	'token_str': 'GO:0000139',
	'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
	{'score': 0.01181247178465128,
	'token': 37,
	'token_str': 'GO:0007165',
	'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'},
	{'score': 0.01000668853521347,
	'token': 14,
	'token_str': 'GO:0005737',
	'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}
	]

	```

	## Training Data

	The dataset was trained using [damlab/uniprot](https://huggingface.co/datasets/damlab/uniprot) from a random initial model.
	The Gene Ontology functions were sorted (by ID number) along with annotating term.

	## Training Procedure

	### Preprocessing

	All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.

	### Training

	Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.


	## BibTeX Entry and Citation Info

	[More Information Needed]