kuleshov-group
/

PlantCaduceus_l24

Feature Extraction

Model card Files Files and versions Community

Jingjing Zhai commited on Jun 7

Commit

c0420f1

•

1 Parent(s): d62653f

Update README

Files changed (1) hide show

README.md +21 -3

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ license: apache-2.0
 ## Model Overview
-PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus architecture and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
 - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
 - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
@@ -14,7 +14,8 @@ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Util
 ## How to use
 ```python
 from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
-model_path = 'maize-genetics/PlantCaduceus_l24'
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
 model.eval()
@@ -30,4 +31,21 @@ encoding = tokenizer.encode_plus(
 input_ids = encoding["input_ids"].to(device)
 with torch.inference_mode():
     outputs = model(input_ids=input_ids, output_hidden_states=True)
-```

 ## Model Overview
+PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
 - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
 - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
 ## How to use
 ```python
 from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
+import torch
+model_path = 'kuleshov-group/PlantCaduceus_l24'
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
 model.eval()
 input_ids = encoding["input_ids"].to(device)
 with torch.inference_mode():
     outputs = model(input_ids=input_ids, output_hidden_states=True)
+```
+## Citation
+```bibtex
+@article {Zhai2024.06.04.596709,
+	author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
+	title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
+	elocation-id = {2024.06.04.596709},
+	year = {2024},
+	doi = {10.1101/2024.06.04.596709},
+	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
+	eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
+	journal = {bioRxiv}
+}
+```
+## Contact
+Jingjing Zhai (jz963@cornell.edu)