File size: 2,353 Bytes
d0f26b0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
license: cc-by-nc-sa-4.0
widget:
- text: AAAAGCGACATGACCAAACTGCCCCTCACCCGCCGCACTGATGACCGA
tags:
- DNA
- biology
- genomics
datasets:
- zhangtaolab/plant_reference_genomes
---
# Plant foundation DNA large language models
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.
**Developed by:** zhangtaolab
### Model Sources
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
- **Manuscript:** [Versatile applications of foundation DNA language models in plant genomes]()
### Architecture
The model is trained based on the OpenAI GPT-2 model with modified tokenizer specific for DNA sequence.
### How to use
Install the runtime library first:
```bash
pip install transformers
```
Here is a simple code for inference:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = 'plant-dnagpt'
# load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
# example sequence and tokenization
sequences = ['ATATACGGCCGNC','GGGTATCGCTTCCGAC']
tokens = tokenizer(sequences,padding="longest")['input_ids']
print(f"Tokenzied sequence: {tokenizer.batch_decode(tokens)}")
# inference
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
inputs = tokenizer(sequences, truncation=True, padding='max_length', max_length=512,
return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
outs = model(
**inputs,
output_hidden_states=True
)
# get the final layer embeddings and prediction logits
embeddings = outs['hidden_states'][-1].detach().numpy()
logits = outs['logits'].detach().numpy()
```
### Training data
We use CausalLM method to pre-train the model, the tokenized sequence have a maximum length of 512.
Detailed training procedure can be found in our manuscript.
#### Hardware
Model was pre-trained on a NVIDIA RTX4090 GPU (24 GB).
|