KeeeeepGoing commited on
Commit
624e5f8
1 Parent(s): d54f6cc

Upload 8 files

Browse files
README.md CHANGED
@@ -1,3 +1,74 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ widget:
4
+ - text: AAAAGCGACATGACCAAACTGCCCCTCACCCGCCGCACTGATGACCGA
5
+ tags:
6
+ - DNA
7
+ - biology
8
+ - genomics
9
+ datasets:
10
+ - zhangtaolab/plant_reference_genomes
11
+ ---
12
+ # Plant foundation DNA large language models
13
+
14
+ The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
15
+ All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.
16
+
17
+
18
+ **Developed by:** zhangtaolab
19
+
20
+ ### Model Sources
21
+
22
+ - **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
23
+ - **Manuscript:** [Versatile applications of foundation DNA language models in plant genomes]()
24
+
25
+ ### Architecture
26
+
27
+ The model is trained based on the Google Gemma model with modified config and tokenizer specific for DNA sequence.
28
+
29
+ ### How to use
30
+
31
+ Install the runtime library first:
32
+ ```bash
33
+ pip install transformers
34
+ ```
35
+
36
+ Here is a simple code for inference:
37
+ ```python
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer
39
+ import torch
40
+
41
+ model_name = 'plant-dnamamba-4mer'
42
+ # load model and tokenizer
43
+ model = AutoModelForCausalLM.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
44
+ tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
45
+
46
+ # example sequence and tokenization
47
+ sequences = ['ATATACGGCCGNC','GGGTATCGCTTCCGAC']
48
+ tokens = tokenizer(sequences,padding="longest")['input_ids']
49
+ print(f"Tokenzied sequence: {tokenizer.batch_decode(tokens)}")
50
+
51
+ # inference
52
+ device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
53
+ model.to(device)
54
+ inputs = tokenizer(sequences, truncation=True, padding='max_length', max_length=512,
55
+ return_tensors="pt")
56
+ inputs = {k: v.to(device) for k, v in inputs.items()}
57
+ outs = model(
58
+ **inputs,
59
+ output_hidden_states=True
60
+ )
61
+
62
+ # get the final layer embeddings and prediction logits
63
+ embeddings = outs['hidden_states'][-1].detach().numpy()
64
+ logits = outs['logits'].detach().numpy()
65
+ ```
66
+
67
+
68
+ ### Training data
69
+ We use CausalLM method to pre-train the model, the tokenized sequence have a maximum length of 512.
70
+ Detailed training procedure can be found in our manuscript.
71
+
72
+
73
+ #### Hardware
74
+ Model was pre-trained on a NVIDIA RTX4090 GPU (24 GB).
config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "../model/PlantDna_Mamba_4mer",
3
+ "architectures": [
4
+ "MambaForCausalLM"
5
+ ],
6
+ "bos_token_id": 0,
7
+ "conv_kernel": 4,
8
+ "d_inner": 1536,
9
+ "d_model": 768,
10
+ "eos_token_id": 0,
11
+ "expand": 2,
12
+ "fused_add_norm": true,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 768,
15
+ "initializer_range": 0.1,
16
+ "intermediate_size": 1536,
17
+ "layer_norm_epsilon": 1e-05,
18
+ "model_type": "mamba",
19
+ "n_layer": 24,
20
+ "num_hidden_layers": 24,
21
+ "pad_token_id": 0,
22
+ "pad_vocab_size_multiple": 8,
23
+ "rescale_prenorm_residual": false,
24
+ "residual_in_fp32": true,
25
+ "rms_norm": true,
26
+ "ssm_cfg": {},
27
+ "state_size": 16,
28
+ "time_step_floor": 0.0001,
29
+ "time_step_init_scheme": "random",
30
+ "time_step_max": 0.1,
31
+ "time_step_min": 0.001,
32
+ "time_step_rank": 48,
33
+ "time_step_scale": 1.0,
34
+ "torch_dtype": "float32",
35
+ "transformers_version": "4.39.1",
36
+ "use_bias": false,
37
+ "use_cache": true,
38
+ "use_conv_bias": true,
39
+ "vocab_size": 267
40
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 0,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.39.1"
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a025b5c37e02def2cfdd9dff692362d117859da760d7917c20bc30acdb7d54b3
3
+ size 362927464
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:215ecb2901fced644806a14f6f405c3fecdfb4a301a7ef012ea7c2d60f506919
3
+ size 362978706
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "<cls>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "<mask>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<unk>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<mask>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<cls>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "clean_up_tokenization_spaces": true,
37
+ "cls_token": "<cls>",
38
+ "eos_token": null,
39
+ "mask_token": "<mask>",
40
+ "model_max_length": 512,
41
+ "pad_token": "<pad>",
42
+ "tokenizer_class": "EsmTokenizer",
43
+ "unk_token": "<unk>"
44
+ }
vocab.txt ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <unk>
2
+ <pad>
3
+ <mask>
4
+ <cls>
5
+ AAAA
6
+ AAAT
7
+ AAAC
8
+ AAAG
9
+ AATA
10
+ AATT
11
+ AATC
12
+ AATG
13
+ AACA
14
+ AACT
15
+ AACC
16
+ AACG
17
+ AAGA
18
+ AAGT
19
+ AAGC
20
+ AAGG
21
+ ATAA
22
+ ATAT
23
+ ATAC
24
+ ATAG
25
+ ATTA
26
+ ATTT
27
+ ATTC
28
+ ATTG
29
+ ATCA
30
+ ATCT
31
+ ATCC
32
+ ATCG
33
+ ATGA
34
+ ATGT
35
+ ATGC
36
+ ATGG
37
+ ACAA
38
+ ACAT
39
+ ACAC
40
+ ACAG
41
+ ACTA
42
+ ACTT
43
+ ACTC
44
+ ACTG
45
+ ACCA
46
+ ACCT
47
+ ACCC
48
+ ACCG
49
+ ACGA
50
+ ACGT
51
+ ACGC
52
+ ACGG
53
+ AGAA
54
+ AGAT
55
+ AGAC
56
+ AGAG
57
+ AGTA
58
+ AGTT
59
+ AGTC
60
+ AGTG
61
+ AGCA
62
+ AGCT
63
+ AGCC
64
+ AGCG
65
+ AGGA
66
+ AGGT
67
+ AGGC
68
+ AGGG
69
+ TAAA
70
+ TAAT
71
+ TAAC
72
+ TAAG
73
+ TATA
74
+ TATT
75
+ TATC
76
+ TATG
77
+ TACA
78
+ TACT
79
+ TACC
80
+ TACG
81
+ TAGA
82
+ TAGT
83
+ TAGC
84
+ TAGG
85
+ TTAA
86
+ TTAT
87
+ TTAC
88
+ TTAG
89
+ TTTA
90
+ TTTT
91
+ TTTC
92
+ TTTG
93
+ TTCA
94
+ TTCT
95
+ TTCC
96
+ TTCG
97
+ TTGA
98
+ TTGT
99
+ TTGC
100
+ TTGG
101
+ TCAA
102
+ TCAT
103
+ TCAC
104
+ TCAG
105
+ TCTA
106
+ TCTT
107
+ TCTC
108
+ TCTG
109
+ TCCA
110
+ TCCT
111
+ TCCC
112
+ TCCG
113
+ TCGA
114
+ TCGT
115
+ TCGC
116
+ TCGG
117
+ TGAA
118
+ TGAT
119
+ TGAC
120
+ TGAG
121
+ TGTA
122
+ TGTT
123
+ TGTC
124
+ TGTG
125
+ TGCA
126
+ TGCT
127
+ TGCC
128
+ TGCG
129
+ TGGA
130
+ TGGT
131
+ TGGC
132
+ TGGG
133
+ CAAA
134
+ CAAT
135
+ CAAC
136
+ CAAG
137
+ CATA
138
+ CATT
139
+ CATC
140
+ CATG
141
+ CACA
142
+ CACT
143
+ CACC
144
+ CACG
145
+ CAGA
146
+ CAGT
147
+ CAGC
148
+ CAGG
149
+ CTAA
150
+ CTAT
151
+ CTAC
152
+ CTAG
153
+ CTTA
154
+ CTTT
155
+ CTTC
156
+ CTTG
157
+ CTCA
158
+ CTCT
159
+ CTCC
160
+ CTCG
161
+ CTGA
162
+ CTGT
163
+ CTGC
164
+ CTGG
165
+ CCAA
166
+ CCAT
167
+ CCAC
168
+ CCAG
169
+ CCTA
170
+ CCTT
171
+ CCTC
172
+ CCTG
173
+ CCCA
174
+ CCCT
175
+ CCCC
176
+ CCCG
177
+ CCGA
178
+ CCGT
179
+ CCGC
180
+ CCGG
181
+ CGAA
182
+ CGAT
183
+ CGAC
184
+ CGAG
185
+ CGTA
186
+ CGTT
187
+ CGTC
188
+ CGTG
189
+ CGCA
190
+ CGCT
191
+ CGCC
192
+ CGCG
193
+ CGGA
194
+ CGGT
195
+ CGGC
196
+ CGGG
197
+ GAAA
198
+ GAAT
199
+ GAAC
200
+ GAAG
201
+ GATA
202
+ GATT
203
+ GATC
204
+ GATG
205
+ GACA
206
+ GACT
207
+ GACC
208
+ GACG
209
+ GAGA
210
+ GAGT
211
+ GAGC
212
+ GAGG
213
+ GTAA
214
+ GTAT
215
+ GTAC
216
+ GTAG
217
+ GTTA
218
+ GTTT
219
+ GTTC
220
+ GTTG
221
+ GTCA
222
+ GTCT
223
+ GTCC
224
+ GTCG
225
+ GTGA
226
+ GTGT
227
+ GTGC
228
+ GTGG
229
+ GCAA
230
+ GCAT
231
+ GCAC
232
+ GCAG
233
+ GCTA
234
+ GCTT
235
+ GCTC
236
+ GCTG
237
+ GCCA
238
+ GCCT
239
+ GCCC
240
+ GCCG
241
+ GCGA
242
+ GCGT
243
+ GCGC
244
+ GCGG
245
+ GGAA
246
+ GGAT
247
+ GGAC
248
+ GGAG
249
+ GGTA
250
+ GGTT
251
+ GGTC
252
+ GGTG
253
+ GGCA
254
+ GGCT
255
+ GGCC
256
+ GGCG
257
+ GGGA
258
+ GGGT
259
+ GGGC
260
+ GGGG
261
+ A
262
+ T
263
+ C
264
+ G
265
+ N
266
+ <eos>
267
+ <bos>