|
--- |
|
license: cc-by-nc-sa-4.0 |
|
widget: |
|
- text: ATTTTGGAAATCGGAAATCTTCTGTGTTTATCCGTACCAATCCCTGGATCAGTAGTTGGTACACACATACATAGTTGGATCACAACATATCACGAATGAATTTAGGCTAAAAGAGTTAAATACTTACATTAGGGCCAGGTAGGCCAACAACTATCAGAGAACAACAGCGGAAGACAAAATAATATAAGGGCCCGGTTAACATGCCACAAGCAGTCGACTGGGGAACGAGACCTAGAACAAGACCGCACTCCGATCATCTTGTGGGATACGCAAGCGTACCGACAAGGGCTTCTCTTCAACACTCTCCTAAAAGATATATAAATAGCAAGGGTGAGTACCAACCGTACTCAGCAAGCCACCACAACAACAATGCGTATGATAGAGGGTATTTCAAGGAATGGCTTCAGGTTCTTTTGCATAAAGCTAATTTTACAATTCTTTTCACAAGCCTAAAACCTAGCATAGACTGATCAAATTTTAGTACCAGTGTTCACTTTAAACAACGACGGTTCTGTCCACCATCCATTGTGATCCCAAGGATAGCTTCCCGCCATTGAATCGTCATGGTTTTCTAAGGATGTCCACCTTCCCTCCTCTCGGGAAGTGGCTCCATCAGCATAAAATTCATCATGCAATATCCCATCCCCCACAAGTTAAAAATTTAGAGTCTAGCCAAGTGTAATACATGTCCCGGTGCTCAATAACCGCGAGCACGGCTATTCGAATAGATTTGGTTTACTCACACTGCAGTGGATGTACACTTTACCCGCACTCCGCAACTGCCCAACACATGAGCCTCGTCCGAACACATGGGACGCGTCACGGCAAAGCTTTTCGATAACCTCGCATTGGTAGTACCCGCTCCATGAACTTAAATCCTCATGCACTCTAGGCGTCCATGTTTCTAGCAGTGAGAGGAGTTCTGGCGCTCCCGGGAAAGAGAAGTCTCACACGCATATTAAATTATGGTTCAAGTTAAGTTCTCTCTCTCACACACTCA |
|
tags: |
|
- DNA |
|
- biology |
|
- genomics |
|
--- |
|
# Plant foundation DNA large language models |
|
|
|
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. |
|
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. |
|
|
|
|
|
**Developed by:** zhangtaolab |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs) |
|
- **Manuscript:** [PDLLMs: A group of tailored DNA large language models for analyzing plant genomes]() |
|
|
|
### Architecture |
|
|
|
The model is trained based on the State-Space Mamba-130m model with modified tokenizer specific for DNA sequence. |
|
|
|
This model is fine-tuned for predicting sequence conservation. |
|
|
|
### How to use |
|
|
|
Install the runtime library first: |
|
```bash |
|
pip install transformers |
|
pip install causal-conv1d<=1.2.0 |
|
pip install mamba-ssm<2.0.0 |
|
``` |
|
|
|
Since `transformers` library (version < 4.43.0) does not provide a MambaForSequenceClassification function, we wrote a script to train Mamba model for sequence classification. |
|
An inference code can be found in our [GitHub](https://github.com/zhangtaolab/plant_DNA_LLMs). |
|
Note that Plant DNAMamba model requires NVIDIA GPU to run. |
|
|
|
|
|
### Training data |
|
We use a custom MambaForSequenceClassification script to fine-tune the model. |
|
Detailed training procedure can be found in our manuscript. |
|
|
|
|
|
#### Hardware |
|
Model was trained on a NVIDIA GTX4090 GPU (24 GB). |
|
|