zhihan1996 commited on
Commit
5fd206e
1 Parent(s): 6041066

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ metrics:
3
+ - matthews_correlation
4
+ - f1
5
+ tags:
6
+ - biology
7
+ - medical
8
+ ---
9
+ DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
10
+
11
+ To load the model from huggingface:
12
+ ```
13
+ import torch
14
+ from transformers import AutoTokenizer, AutoModel
15
+
16
+ tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
17
+ model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
18
+ ```
19
+
20
+ To calculate the embedding of a dna sequence
21
+ ```
22
+ dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
23
+ inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
24
+ hidden_states = model(inputs)[0] # [1, sequence_length, 768]
25
+
26
+ # embedding with mean pooling
27
+ embedding_mean = torch.mean(hidden_states[0], dim=0)
28
+ print(embedding_mean.shape) # expect to be 768
29
+
30
+ # embedding with max pooling
31
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
32
+ print(embedding_max.shape) # expect to be 768
33
+ ```