Jingjing Zhai commited on
Commit
d62653f
1 Parent(s): e6b01d1

Brief description of PlantCaduceus

Browse files
Files changed (1) hide show
  1. README.md +30 -0
README.md CHANGED
@@ -1,3 +1,33 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ ## Model Overview
6
+
7
+ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus architecture and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
+
9
+ - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
10
+ - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
11
+ - **PlantCaduceus_l28**: 28 layers, 768 hidden size, 112M parameters
12
+ - **PlantCaduceus_l32**: 32 layers, 1024 hidden size, 225M parameters
13
+
14
+ ## How to use
15
+ ```python
16
+ from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
17
+ model_path = 'maize-genetics/PlantCaduceus_l24'
18
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
19
+ model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
20
+ model.eval()
21
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
22
+
23
+ sequence = "ATGCGTACGATCGTAG"
24
+ encoding = tokenizer.encode_plus(
25
+ sequence,
26
+ return_tensors="pt",
27
+ return_attention_mask=False,
28
+ return_token_type_ids=False
29
+ )
30
+ input_ids = encoding["input_ids"].to(device)
31
+ with torch.inference_mode():
32
+ outputs = model(input_ids=input_ids, output_hidden_states=True)
33
+ ```