RaphaelMourad commited on
Commit
a638111
1 Parent(s): 474f148

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -1
README.md CHANGED
@@ -1,3 +1,61 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ tags:
4
+ - pretrained
5
+ - mistral
6
+ - DNA
7
+ - biology
8
+ - genomics
9
  ---
10
+
11
+ # Model Card for mixtral-dna-yeast-v0.2 (mistral for DNA)
12
+
13
+ The mixtral-dna-yeast-v0.2 Large Language Model (LLM) is a pretrained generative DNA text model with 17.31M parameters x 8 experts = 138.5M parameters.
14
+ It is derived from Mistral-7B-v0.1 model, which was simplified for DNA: the number of layers and the hidden size were reduced.
15
+ The model was pretrained using around 1000 yeast genomes with 10kb DNA sequences.
16
+
17
+ The yeast genomes are from: https://www.nature.com/articles/s41586-018-0030-5
18
+
19
+ For full details of this model please read our [github repo](https://github.com/raphaelmourad/Mistral-DNA).
20
+
21
+ ## Model Architecture
22
+
23
+ Like Mistral-7B-v0.1, it is a transformer model, with the following architecture choices:
24
+ - Grouped-Query Attention
25
+ - Sliding-Window Attention
26
+ - Byte-fallback BPE tokenizer
27
+
28
+ ## Load the model from huggingface:
29
+
30
+ ```
31
+ import torch
32
+ from transformers import AutoTokenizer, AutoModel
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained("RaphaelMourad/mixtral-dna-yeast-v0.2", trust_remote_code=True) # Same as DNABERT2
35
+ model = AutoModel.from_pretrained("RaphaelMourad/mixtral-dna-yeast-v0.2", trust_remote_code=True)
36
+ ```
37
+
38
+ ## Calculate the embedding of a DNA sequence
39
+
40
+ ```
41
+ dna = "TGATGATTGGCGCGGCTAGGATCGGCT"
42
+ inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
43
+ hidden_states = model(inputs)[0] # [1, sequence_length, 256]
44
+
45
+ # embedding with max pooling
46
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
47
+ print(embedding_max.shape) # expect to be 256
48
+ ```
49
+
50
+ ## Troubleshooting
51
+
52
+ Ensure you are utilizing a stable version of Transformers, 4.34.0 or newer.
53
+
54
+ ## Notice
55
+
56
+ Mistral-DNA is a pretrained base model for DNA.
57
+
58
+ ## Contact
59
+
60
+ Raphaël Mourad. raphael.mourad@univ-tlse3.fr
61
+