Jingjing Zhai commited on
Commit
c0420f1
1 Parent(s): d62653f

Update README

Browse files
Files changed (1) hide show
  1. README.md +21 -3
README.md CHANGED
@@ -4,7 +4,7 @@ license: apache-2.0
4
 
5
  ## Model Overview
6
 
7
- PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus architecture and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
 
9
  - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
10
  - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
@@ -14,7 +14,8 @@ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Util
14
  ## How to use
15
  ```python
16
  from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
17
- model_path = 'maize-genetics/PlantCaduceus_l24'
 
18
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
19
  model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
20
  model.eval()
@@ -30,4 +31,21 @@ encoding = tokenizer.encode_plus(
30
  input_ids = encoding["input_ids"].to(device)
31
  with torch.inference_mode():
32
  outputs = model(input_ids=input_ids, output_hidden_states=True)
33
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Model Overview
6
 
7
+ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
 
9
  - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
10
  - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
 
14
  ## How to use
15
  ```python
16
  from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
17
+ import torch
18
+ model_path = 'kuleshov-group/PlantCaduceus_l24'
19
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
20
  model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
21
  model.eval()
 
31
  input_ids = encoding["input_ids"].to(device)
32
  with torch.inference_mode():
33
  outputs = model(input_ids=input_ids, output_hidden_states=True)
34
+ ```
35
+
36
+ ## Citation
37
+ ```bibtex
38
+ @article {Zhai2024.06.04.596709,
39
+ author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
40
+ title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
41
+ elocation-id = {2024.06.04.596709},
42
+ year = {2024},
43
+ doi = {10.1101/2024.06.04.596709},
44
+ URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
45
+ eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
46
+ journal = {bioRxiv}
47
+ }
48
+ ```
49
+
50
+ ## Contact
51
+ Jingjing Zhai (jz963@cornell.edu)