Jingjing Zhai commited on
Commit
f3a7ed0
1 Parent(s): 792da0e

Update README

Browse files
Files changed (1) hide show
  1. README.md +1 -4
README.md CHANGED
@@ -4,7 +4,7 @@ license: apache-2.0
4
 
5
  ## Model Overview
6
 
7
- PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
 
9
  - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
10
  - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
@@ -40,13 +40,10 @@ with torch.inference_mode():
40
  elocation-id = {2024.06.04.596709},
41
  year = {2024},
42
  doi = {10.1101/2024.06.04.596709},
43
- publisher = {Cold Spring Harbor Laboratory},
44
- abstract = {Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.Competing Interest StatementThe authors have declared no competing interest.},
45
  URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
46
  eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
47
  journal = {bioRxiv}
48
  }
49
-
50
  ```
51
 
52
  ## Contact
 
4
 
5
  ## Model Overview
6
 
7
+ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
 
9
  - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
10
  - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
 
40
  elocation-id = {2024.06.04.596709},
41
  year = {2024},
42
  doi = {10.1101/2024.06.04.596709},
 
 
43
  URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
44
  eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
45
  journal = {bioRxiv}
46
  }
 
47
  ```
48
 
49
  ## Contact