zhangtaolab
/

plant-dnamamba-2mer-open_chromatin

Model card Files Files and versions Community

plant-dnamamba-2mer-open_chromatin / README.md

lgq12697

update model

6ebca88 16 days ago

|

history blame contribute delete

1.88 kB

	---
	license: cc-by-nc-sa-4.0
	widget:
	- text: AGTCGCCGCAACCCACACACGGACGGCTCGACGTGGCGATCTTAGCGGCTCATCCGCCCGGCCTCCCTCGCGCTCGATCGCTACGCAGCCTACGCTCGTTTCGCTCGGTTCGGTGGGTCGCCGATCTGGCGCCACGGCGGCTACCAACGACACCGCGATTGAGAAGGGTGCGTGGCCGTGGAGTCGTGGAGAAACGCCCGCGCGCGCGGGTGCGGCGAGGGACGACGACCGCGTCGTGCGGATCGATTGGCGGGGCAGCTCGGCGCCCCG
	tags:
	- DNA
	- biology
	- genomics
	---
	# Plant foundation DNA large language models

	The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
	All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.


	Developed by: zhangtaolab

	### Model Sources

	- Repository: [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
	- Manuscript: [PDLLMs: A group of tailored DNA large language models for analyzing plant genomes]()

	### Architecture

	The model is trained based on the State-Space Mamba-130m model with modified tokenizer specific for DNA sequence.

	This model is fine-tuned for predicting open chromatin.

	### How to use

	Install the runtime library first:
	```bash
	pip install transformers
	pip install causal-conv1d<=1.2.0
	pip install mamba-ssm<2.0.0
	```

	Since `transformers` library (version < 4.43.0) does not provide a MambaForSequenceClassification function, we wrote a script to train Mamba model for sequence classification.
	An inference code can be found in our [GitHub](https://github.com/zhangtaolab/plant_DNA_LLMs).
	Note that Plant DNAMamba model requires NVIDIA GPU to run.


	### Training data
	We use a custom MambaForSequenceClassification script to fine-tune the model.
	Detailed training procedure can be found in our manuscript.


	#### Hardware
	Model was trained on a NVIDIA GTX4090 GPU (24 GB).