Jingjing Zhai commited on
Commit
792da0e
1 Parent(s): 71df09f

Update README

Browse files
Files changed (1) hide show
  1. README.md +23 -3
README.md CHANGED
@@ -4,7 +4,7 @@ license: apache-2.0
4
 
5
  ## Model Overview
6
 
7
- PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus architecture and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
 
9
  - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
10
  - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
@@ -14,7 +14,7 @@ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Util
14
  ## How to use
15
  ```python
16
  from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
17
- model_path = 'maize-genetics/PlantCaduceus_l32'
18
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
19
  model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
20
  model.eval()
@@ -30,4 +30,24 @@ encoding = tokenizer.encode_plus(
30
  input_ids = encoding["input_ids"].to(device)
31
  with torch.inference_mode():
32
  outputs = model(input_ids=input_ids, output_hidden_states=True)
33
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Model Overview
6
 
7
+ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
 
9
  - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
10
  - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
 
14
  ## How to use
15
  ```python
16
  from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
17
+ model_path = 'kuleshov-group/PlantCaduceus_l32'
18
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
19
  model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
20
  model.eval()
 
30
  input_ids = encoding["input_ids"].to(device)
31
  with torch.inference_mode():
32
  outputs = model(input_ids=input_ids, output_hidden_states=True)
33
+ ```
34
+
35
+ ## Citation
36
+ ```bibtex
37
+ @article {Zhai2024.06.04.596709,
38
+ author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
39
+ title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
40
+ elocation-id = {2024.06.04.596709},
41
+ year = {2024},
42
+ doi = {10.1101/2024.06.04.596709},
43
+ publisher = {Cold Spring Harbor Laboratory},
44
+ abstract = {Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.Competing Interest StatementThe authors have declared no competing interest.},
45
+ URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
46
+ eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
47
+ journal = {bioRxiv}
48
+ }
49
+
50
+ ```
51
+
52
+ ## Contact
53
+ Jingjing Zhai (jz963@cornell.edu)