update readme
Browse files
README.md
CHANGED
@@ -1,3 +1,138 @@
|
|
1 |
---
|
2 |
-
license:
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: bsd-3-clause
|
3 |
---
|
4 |
+
|
5 |
+
# HyenaDNA
|
6 |
+
|
7 |
+
Welcome! HyenaDNA is a genomic foundation model pretrained on the human reference genome with sequence lengths of up to **1 million tokens** at **single nucleotide resolution**.
|
8 |
+
|
9 |
+
See below for an [overview](#model) of the model and training. Better yet, check out these resources.
|
10 |
+
|
11 |
+
**Checkout our other resources:**
|
12 |
+
|
13 |
+
- [arxiv](https://arxiv.org/abs/2302.10866) (placeholder)
|
14 |
+
- [blog](https://hazyresearch.stanford.edu/blog/2023-03-07-hyena) (placeholder)
|
15 |
+
- [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL#scrollTo=dgkZggOetkbR)
|
16 |
+
- [github](https://github.com/HazyResearch/hyena-dna)
|
17 |
+
|
18 |
+
|
19 |
+
**Links to all HuggingFace models:**
|
20 |
+
|
21 |
+
- [tiny-1k](https://huggingface.co/LongSafari/hyenadna-tiny-1k-seqlen)
|
22 |
+
- [small-32k](https://huggingface.co/LongSafari/hyenadna-small-32k-seqlen/tree/main)
|
23 |
+
- [medium-160k](https://huggingface.co/LongSafari/hyenadna-medium-160k-seqlen/tree/main)
|
24 |
+
- [medium-450k](https://huggingface.co/LongSafari/hyenadna-medium-450k-seqlen/tree/main)
|
25 |
+
- [large-1m](https://huggingface.co/LongSafari/hyenadna-large-1m-seqlen/settings)
|
26 |
+
|
27 |
+
|
28 |
+
### Sample snippet
|
29 |
+
|
30 |
+
|
31 |
+
This code example lets you select which pretrained model to load from HuggingFace and do inference to get embeddings.
|
32 |
+
|
33 |
+
See the `huggingface.py` script in the main [github](https://github.com/HazyResearch/hyena-dna), or the [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL#scrollTo=dgkZggOetkbR) for these classes.
|
34 |
+
|
35 |
+
|
36 |
+
```python
|
37 |
+
|
38 |
+
# instantiate pretrained model
|
39 |
+
pretrained_model_name = 'hyenadna-medium-450k-seqlen'
|
40 |
+
max_length = 450k
|
41 |
+
|
42 |
+
model = HyenaDNAPreTrainedModel.from_pretrained(
|
43 |
+
'./checkpoints',
|
44 |
+
pretrained_model_name,
|
45 |
+
)
|
46 |
+
|
47 |
+
# create tokenizer
|
48 |
+
tokenizer = CharacterTokenizer(
|
49 |
+
characters=['A', 'C', 'G', 'T', 'N'], # add DNA characters
|
50 |
+
model_max_length=max_length,
|
51 |
+
)
|
52 |
+
|
53 |
+
# create a sample
|
54 |
+
sequence = 'ACTG' * int(max_length/4)
|
55 |
+
tok_seq = tokenizer(sequence)["input_ids"]
|
56 |
+
|
57 |
+
# place on device, convert to tensor
|
58 |
+
tok_seq = torch.LongTensor(tok_seq).unsqueeze(0).to(device) # unsqueeze for batch dim
|
59 |
+
|
60 |
+
# prep model and forward
|
61 |
+
model.to(device)
|
62 |
+
|
63 |
+
with torch.inference_mode():
|
64 |
+
embeddings = model(tok_seq)
|
65 |
+
|
66 |
+
print(embeddings.shape) # embeddings here!
|
67 |
+
|
68 |
+
|
69 |
+
```
|
70 |
+
|
71 |
+
### How to use pretrained weights
|
72 |
+
|
73 |
+
- [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL#scrollTo=dgkZggOetkbR)
|
74 |
+
|
75 |
+
The colab is the easiest entry point, you can finetune a small model, and do inference on DNA sequences up to 450k on the free tier (T4 GPU), and up to 1 million on the paid tier (A100). It handles all the HuggingFace integration for you, so it's helpful to see how.
|
76 |
+
|
77 |
+
- [github](https://github.com/HazyResearch/hyena-dna)
|
78 |
+
|
79 |
+
Otherwise, checkout of the main HyenaDNA repo for how to load weights into Pytorch Lightning. We use Pytorch Lightning for pretraining and fine-tuning most of our models. If you want to use our actual pretraining code, you can clone this HuggingFace repo to download the actual weights.ckpt, and then pass it.
|
80 |
+
|
81 |
+
|
82 |
+
If you want a standalone version that's easy to port into your own code, we have that and a HuggingFace example in the repo too, under `huggingface.py`.
|
83 |
+
|
84 |
+
|
85 |
+
## Model & Training Overview
|
86 |
+
<a name="model"></a>
|
87 |
+
|
88 |
+
HyenaDNA uses a simple stack of [Hyena](https://arxiv.org/abs/2302.10866) operators, which are a subquadratic drop-in replacement for attention in Transformers. The Hyena operator is able to match quality in language modeling by using modified input projections, implicit convolutions and gating, all subquadratic operations.
|
89 |
+
|
90 |
+
This enables HyenaDNA to reach context lengths of up to 500x longer than previous genomic Transformer models using dense attention, and train 160x faster at sequence length 1M (compared to Flash Attention).
|
91 |
+
|
92 |
+
We use a single character tokenizer with a primary vocab of 4 nucleotides (plus special tokens), enabling the single nucleotide resolution, a first in genomic foundation models. In addition, the implicit long convolution enables a **global receptive field** at each layer.
|
93 |
+
|
94 |
+
We pretrain using next token (nucleotide) prediction on the human reference genome (HG38).
|
95 |
+
|
96 |
+
HyenaDNA sets new SotA on 23 downstream tasks including predicting regulatory elements, chromatin profiles, and species classification. We also explore what new capabilities open up with long context in genomics, including the first use of in-context learning with soft prompt tuneable tokens and instruction fine-tuning.
|
97 |
+
|
98 |
+
|
99 |
+
Check out our [blog](https://hazyresearch.stanford.edu/blog/2023-03-07-hyena) for more details on HyenaDNA!
|
100 |
+
|
101 |
+
### Authors
|
102 |
+
|
103 |
+
Eric Nguyen*, Michael Poli*, Marjan Faizi*, Armin Thomas, Callum Birch Sykes, Michael Wornow, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, Stefano Ermon, Stephen Baccus, Chris Re.
|
104 |
+
|
105 |
+
**Contact**
|
106 |
+
|
107 |
+
Eric Nguyen, etnguyen@stanford.edu
|
108 |
+
Michael Poli, poli@stanford.edu
|
109 |
+
Marjan Faizi, Marjan_Faizi@hms.harvard.edu
|
110 |
+
|
111 |
+
|
112 |
+
## Citation
|
113 |
+
|
114 |
+
|
115 |
+
If you use HyenaDNA in your work, feel free to cite us :)
|
116 |
+
|
117 |
+
```
|
118 |
+
|
119 |
+
|
120 |
+
|
121 |
+
|
122 |
+
|
123 |
+
```
|
124 |
+
|
125 |
+
|
126 |
+
|
127 |
+
|
128 |
+
|
129 |
+
|
130 |
+
|
131 |
+
|
132 |
+
|
133 |
+
|
134 |
+
|
135 |
+
|
136 |
+
|
137 |
+
|
138 |
+
|