File size: 2,564 Bytes
c177a1d
 
 
d62653f
 
 
c0420f1
d62653f
a19e0dd
 
 
 
d62653f
 
 
 
c0420f1
 
d62653f
208dd61
d62653f
 
 
 
 
 
 
 
 
 
 
 
 
c0420f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---

license: apache-2.0
---


## Model Overview

PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:

- **[PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20)**: 20 layers, 384 hidden size, 20M parameters
- **[PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24)**: 24 layers, 512 hidden size, 40M parameters
- **[PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28)**: 28 layers, 768 hidden size, 112M parameters
- **[PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)**: 32 layers, 1024 hidden size, 225M parameters

## How to use
```python

from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer

import torch

model_path = 'kuleshov-group/PlantCaduceus_l24'

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)

model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)



sequence = "ATGCGTACGATCGTAG"

encoding = tokenizer.encode_plus(

            sequence,

            return_tensors="pt",

            return_attention_mask=False,

            return_token_type_ids=False

        )

input_ids = encoding["input_ids"].to(device)

with torch.inference_mode():

    outputs = model(input_ids=input_ids, output_hidden_states=True)

```

## Citation
```bibtex

@article {Zhai2024.06.04.596709,

	author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},

	title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},

	elocation-id = {2024.06.04.596709},

	year = {2024},

	doi = {10.1101/2024.06.04.596709},

	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},

	eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},

	journal = {bioRxiv}

}

```

## Contact
Jingjing Zhai (jz963@cornell.edu)