melissasanabria
commited on
Commit
•
f6ed259
1
Parent(s):
6efc18c
Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,6 @@ This is the official pre-trained model introduced in [DNA language model GROVER
|
|
9 |
|
10 |
|
11 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
12 |
-
import torch
|
13 |
|
14 |
# Import the tokenizer and the model
|
15 |
tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER")
|
@@ -17,7 +16,7 @@ This is the official pre-trained model introduced in [DNA language model GROVER
|
|
17 |
|
18 |
|
19 |
Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges.
|
20 |
-
We advice to add 100 nucleotides at the beginning and end of every sequence in order to
|
21 |
We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes).
|
22 |
|
23 |
### BibTeX entry and citation info
|
|
|
9 |
|
10 |
|
11 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
12 |
|
13 |
# Import the tokenizer and the model
|
14 |
tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER")
|
|
|
16 |
|
17 |
|
18 |
Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges.
|
19 |
+
We advice to add 100 nucleotides at the beginning and end of every sequence in order to guarantee that your sequence is represented with the same tokens as the original tokenization.
|
20 |
We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes).
|
21 |
|
22 |
### BibTeX entry and citation info
|