--- license: cc-by-nc-sa-4.0 widget: - text: GCGACTCCGCCGCCCCGATCTCCCCGTCGTCCTACAGTGCTCTCCACATCGTAGGCGACCTGGTTGGACTCCTCGACGCCTTGTCCCTACCGCAGGTGTTTGTGGTGGGACAAGGCTGGGGAGCCCTGCTGGCGTGGAACCTCTGCATGTTCCGCCCCGAGCGGGTGCGCGCGCTGGTCAACATGAGCGTCGCCTTCATGCCGCGCAACCCCTCCGTGAAGCCACTTGAGTTGTTTCGGCGGCTCTACGGCGACGGATACTACCTCCTCCGGCTGCAGGAAC tags: - DNA - biology - genomics --- # Plant foundation DNA large language models The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. **Developed by:** zhangtaolab ### Model Sources - **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs) - **Manuscript:** [Versatile applications of foundation DNA language models in plant genomes]() ### Architecture The model is trained based on the State-Space Mamba-130m model with modified tokenizer specific for DNA sequence. This model is fine-tuned for predicting H3K4me3 histone modification. ### How to use Install the runtime library first: ```bash pip install transformers pip install causal-conv1d<=1.2.0 pip install mamba-ssm<2.0.0 ``` Since `transformers` library (version < 4.43.0) does not provide a MambaForSequenceClassification function, we wrote a script to train Mamba model for sequence classification. An inference code can be found in our [GitHub](https://github.com/zhangtaolab/plant_DNA_LLMs). Note that Plant DNAMamba model requires NVIDIA GPU to run. ### Training data We use a custom MambaForSequenceClassification script to fine-tune the model. Detailed training procedure can be found in our manuscript. #### Hardware Model was trained on a NVIDIA GTX4090 GPU (24 GB).