File size: 2,442 Bytes

f3b0cfc
 
 
c6fc544
f3b0cfc
 
 
 
 
 
37c32ee
f3b0cfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed277d4
f3b0cfc

---
license: mit
---
# PseudoGenius

Pseudo Genius is a BERT-based transformer model fine-tuned for the classification of gene sequences as either 'normal' or 'pseudogene'. It was trained specifically on Mycobacterium leprae due to its abundance of pseudogenes but has shown consistent results on other Mycobacterium species.

## Model Description

This model was trained on a dataset extracted from Mycobacterium leprae, using DNA sequences concatenated with their respective protein sequences (separated by tabs) as inputs.
More information on its training, development  and usage can be found at our GitHub repo https://github.com/jimnoneill/PseudoGenius

## Intended Use

The model is intended for researchers and biologists who wish to classify gene sequences quickly. While it performs well on Mycobacterium species, it has not been tested on species with a lower GC content, such as E. coli, and users should exercise caution.

## How to Use

To use the model, concatenate the DNA sequence and protein sequence of a gene, separated by a tab, and feed this as input to the model.

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("PseudoGenius")
model = AutoModelForSequenceClassification.from_pretrained("PseudoGenius")

# Example DNA and protein sequence
sequence = "ATGCGT\tMVKVYAPASSANMSVGFDVLGAAVTPVD"

inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)

# The outputs are raw logits; apply a softmax function to obtain probabilities
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
```

Limitations and Bias
The model was trained on a specific dataset with particular characteristics. It might not generalize well to organisms with different genomic properties, such as a significantly different GC content.

Training Data
The model was trained on a dataset consisting of DNA and protein sequences from Mycobacterium leprae. The sequences are concatenated using a tab character.

Training Procedure
The model was fine-tuned on a DNA BERT (bert-base-uncased) model for 3 epochs, with a batch size of 8 and a learning rate of 2e-5.

Evaluation Results
The model achieved a precision, recall, and F1 score of 1.0 on the test set, indicating that it was able to classify the gene sequences with high accuracy. However, these results should be validated with additional testing, particularly on diverse datasets.