jimnoneill
/

PseudoGenius

Model card Files Files and versions Metrics Training metrics Community

PseudoGenius / README.md

jimnoneill's picture

Update README.md

c6fc544 verified 8 months ago

|

history blame contribute delete

2.44 kB

	---
	license: mit
	---
	# PseudoGenius

	Pseudo Genius is a BERT-based transformer model fine-tuned for the classification of gene sequences as either 'normal' or 'pseudogene'. It was trained specifically on Mycobacterium leprae due to its abundance of pseudogenes but has shown consistent results on other Mycobacterium species.

	## Model Description

	This model was trained on a dataset extracted from Mycobacterium leprae, using DNA sequences concatenated with their respective protein sequences (separated by tabs) as inputs.
	More information on its training, development and usage can be found at our GitHub repo https://github.com/jimnoneill/PseudoGenius

	## Intended Use

	The model is intended for researchers and biologists who wish to classify gene sequences quickly. While it performs well on Mycobacterium species, it has not been tested on species with a lower GC content, such as E. coli, and users should exercise caution.

	## How to Use

	To use the model, concatenate the DNA sequence and protein sequence of a gene, separated by a tab, and feed this as input to the model.

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("PseudoGenius")
	model = AutoModelForSequenceClassification.from_pretrained("PseudoGenius")

	# Example DNA and protein sequence
	sequence = "ATGCGT\tMVKVYAPASSANMSVGFDVLGAAVTPVD"

	inputs = tokenizer(sequence, return_tensors="pt")
	outputs = model(**inputs)

	# The outputs are raw logits; apply a softmax function to obtain probabilities
	probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
	```

	Limitations and Bias
	The model was trained on a specific dataset with particular characteristics. It might not generalize well to organisms with different genomic properties, such as a significantly different GC content.

	Training Data
	The model was trained on a dataset consisting of DNA and protein sequences from Mycobacterium leprae. The sequences are concatenated using a tab character.

	Training Procedure
	The model was fine-tuned on a DNA BERT (bert-base-uncased) model for 3 epochs, with a batch size of 8 and a learning rate of 2e-5.

	Evaluation Results
	The model achieved a precision, recall, and F1 score of 1.0 on the test set, indicating that it was able to classify the gene sequences with high accuracy. However, these results should be validated with additional testing, particularly on diverse datasets.