hdallatorre
commited on
Commit
•
9b6c4ab
1
Parent(s):
941645b
Update README.md
Browse files
README.md
CHANGED
@@ -8,9 +8,9 @@ tags:
|
|
8 |
- genomics
|
9 |
- segmentation
|
10 |
---
|
11 |
-
# segment-nt
|
12 |
|
13 |
-
Segment-NT
|
14 |
elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These
|
15 |
include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
|
16 |
tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
|
@@ -63,8 +63,8 @@ features = [
|
|
63 |
"promoter_Tissue_invariant",
|
64 |
]
|
65 |
|
66 |
-
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/
|
67 |
-
model = AutoModel.from_pretrained("InstaDeepAI/
|
68 |
|
69 |
# Choose the length to which the input sequences are padded. By default, the
|
70 |
# model max length is chosen, but feel free to decrease it as the time taken to
|
@@ -106,7 +106,7 @@ print(f"Intron probabilities shape: {probabilities_intron.shape}")
|
|
106 |
|
107 |
## Training data
|
108 |
|
109 |
-
The **segment-nt
|
110 |
During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by
|
111 |
using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.
|
112 |
|
|
|
8 |
- genomics
|
9 |
- segmentation
|
10 |
---
|
11 |
+
# segment-nt
|
12 |
|
13 |
+
Segment-NT is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
|
14 |
elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These
|
15 |
include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
|
16 |
tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
|
|
|
63 |
"promoter_Tissue_invariant",
|
64 |
]
|
65 |
|
66 |
+
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
|
67 |
+
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
|
68 |
|
69 |
# Choose the length to which the input sequences are padded. By default, the
|
70 |
# model max length is chosen, but feel free to decrease it as the time taken to
|
|
|
106 |
|
107 |
## Training data
|
108 |
|
109 |
+
The **segment-nt** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
|
110 |
During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by
|
111 |
using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.
|
112 |
|