RaphaelMourad commited on
Commit
cf46af6
·
verified ·
1 Parent(s): 47ae4be

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -3
README.md CHANGED
@@ -1,3 +1,60 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - pretrained
5
+ - mistral
6
+ - DNA
7
+ - plasmid
8
+ - biology
9
+ - genomics
10
+ ---
11
+
12
+ # Model Card for Mistral-DNA-v1-138M-plasmid (mistral for DNA)
13
+
14
+ The Mistral-DNA-v1-138M-plasmid Large Language Model (LLM) is a pretrained generative DNA text model with 17.31M parameters x 8 experts = 138.5M parameters.
15
+ It is derived from Mistral-7B-v0.1 model, which was simplified for DNA: the number of layers and the hidden size were reduced.
16
+ The model was pretrained using 82367 plasmid genomes > 10kb.
17
+
18
+ For full details of this model please read our [github repo](https://github.com/raphaelmourad/Mistral-DNA).
19
+
20
+ ## Model Architecture
21
+
22
+ Like Mistral-7B-v0.1, it is a transformer model, with the following architecture choices:
23
+ - Grouped-Query Attention
24
+ - Sliding-Window Attention
25
+ - Byte-fallback BPE tokenizer
26
+
27
+ ## Load the model from huggingface:
28
+
29
+ ```
30
+ import torch
31
+ from transformers import AutoTokenizer, AutoModel
32
+
33
+ tokenizer = AutoTokenizer.from_pretrained("RaphaelMourad/Mistral-DNA-v1-138M-plasmid", trust_remote_code=True) # Same as DNABERT2
34
+ model = AutoModel.from_pretrained("RaphaelMourad/Mistral-DNA-v1-138M-plasmid", trust_remote_code=True)
35
+ ```
36
+
37
+ ## Calculate the embedding of a DNA sequence
38
+
39
+ ```
40
+ dna = "TGATGATTGGCGCGGCTAGGATCGGCT"
41
+ inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
42
+ hidden_states = model(inputs)[0] # [1, sequence_length, 256]
43
+
44
+ # embedding with max pooling
45
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
46
+ print(embedding_max.shape) # expect to be 256
47
+ ```
48
+
49
+ ## Troubleshooting
50
+
51
+ Ensure you are utilizing a stable version of Transformers, 4.34.0 or newer.
52
+
53
+ ## Notice
54
+
55
+ Mistral-DNA-v1-138M-plasmid is a pretrained base model for plasmid genomes.
56
+
57
+ ## Contact
58
+
59
+ Raphaël Mourad. raphael.mourad@univ-tlse3.fr
60
+