PseudoGenius

Pseudo Genius is a BERT-based transformer model fine-tuned for the classification of gene sequences as either 'normal' or 'pseudogene'. It was trained specifically on Mycobacterium leprae due to its abundance of pseudogenes but has shown consistent results on other Mycobacterium species.

Model Description

This model was trained on a dataset extracted from Mycobacterium leprae, using DNA sequences concatenated with their respective protein sequences (separated by tabs) as inputs. More information on its training, development and usage can be found at our GitHub repo https://github.com/jimnoneill/PseudoGenius

Intended Use

The model is intended for researchers and biologists who wish to classify gene sequences quickly. While it performs well on Mycobacterium species, it has not been tested on species with a lower GC content, such as E. coli, and users should exercise caution.

How to Use

To use the model, concatenate the DNA sequence and protein sequence of a gene, separated by a tab, and feed this as input to the model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("PseudoGenius")
model = AutoModelForSequenceClassification.from_pretrained("PseudoGenius")

# Example DNA and protein sequence
sequence = "ATGCGT\tMVKVYAPASSANMSVGFDVLGAAVTPVD"

inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)

# The outputs are raw logits; apply a softmax function to obtain probabilities
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)

Limitations and Bias The model was trained on a specific dataset with particular characteristics. It might not generalize well to organisms with different genomic properties, such as a significantly different GC content.

Training Data The model was trained on a dataset consisting of DNA and protein sequences from Mycobacterium leprae. The sequences are concatenated using a tab character.

Training Procedure The model was fine-tuned on a DNA BERT (bert-base-uncased) model for 3 epochs, with a batch size of 8 and a learning rate of 2e-5.

Evaluation Results The model achieved a precision, recall, and F1 score of 1.0 on the test set, indicating that it was able to classify the gene sequences with high accuracy. However, these results should be validated with additional testing, particularly on diverse datasets.