Rocketknight1 HF staff commited on
Commit
5990e4b
1 Parent(s): fc3d617

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -0
README.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause
3
+ tags:
4
+ - dna
5
+ - biology
6
+ - genomics
7
+ - hyena
8
+ ---
9
+
10
+ # HyenaDNA
11
+
12
+ Welcome! HyenaDNA is a long-range genomic foundation model pretrained on context lengths of up to **1 million tokens** at **single nucleotide resolution**.
13
+
14
+ See below for an [overview](#model) of the model and training. Better yet, check out these resources.
15
+
16
+ **Resources:**
17
+
18
+ - [arxiv](https://arxiv.org/abs/2306.15794)
19
+ - [blog](https://hazyresearch.stanford.edu/blog/2023-06-29-hyena-dna)
20
+ - [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)
21
+ - [github](https://github.com/HazyResearch/hyena-dna)
22
+
23
+
24
+ **Links to all HuggingFace models:**
25
+
26
+ We've uploaded a [collection](https://huggingface.co/collections/LongSafari/hyenadna-models-654d0cbbe113b04ba5a0f638) of all the pretrained HyenaDNA checkpoints.
27
+
28
+ You'll see models of different sizes and sequence lengths. There are also original weights-only versions of each model in the [LongSafari organization](https://huggingface.co/LongSafari), which are designed to be loaded with the original [github](https://github.com/HazyResearch/hyena-dna) repo. These models have identical outputs to the models in the collection above, just different interfaces.
29
+
30
+ See [GPU requirements](#hardware) for each model.
31
+
32
+ ### Using HyenaDNA
33
+
34
+
35
+ In this brief code sample we demonstrate fine-tuning HyenaDNA on a sequence classification task. This sample uses the `medium` checkpoint, with a maximum sequence length of 160k nucleotides. Note that training will fail if you use a sequence length longer than the maximum supported length for your chosen checkpoint.
36
+
37
+ In testing, we have been able to train at a sequence length up to about 250k nucleotides on a Colab T4 GPU (16GB VRAM). For longer sequence lengths, more memory will be required.
38
+
39
+
40
+ ```python
41
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
42
+ from transformers import TrainingArguments, Trainer, logging
43
+ import torch
44
+
45
+ # instantiate pretrained model
46
+ checkpoint = 'LongSafari/hyenadna-medium-160k-seqlen-hf'
47
+ max_length = 160_000
48
+
49
+ # bfloat16 for better speed and reduced memory usage
50
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
51
+ model = AutoModelForSequenceClassification.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
52
+
53
+ # Generate some random sequence and labels
54
+ # If you're copying this code, replace the sequences and labels
55
+ # here with your own data!
56
+ sequence = 'ACTG' * int(max_length/4)
57
+ sequence = [sequence] * 8 # Create 8 identical samples
58
+ tokenized = tokenizer(sequence)["input_ids"]
59
+ labels = [0, 1] * 4
60
+
61
+ # Create a dataset for training
62
+ ds = Dataset.from_dict({"input_ids": tokenized, "labels": labels})
63
+ ds.set_format("pt")
64
+
65
+ # Initialize Trainer
66
+ # Note that we're using extremely small batch sizes to maximize
67
+ # our ability to fit long sequences in memory!
68
+ args = {
69
+ "output_dir": "tmp",
70
+ "num_train_epochs": 1,
71
+ "per_device_train_batch_size": 1,
72
+ "gradient_accumulation_steps": 4,
73
+ "gradient_checkpointing": True,
74
+ "learning_rate": 2e-5,
75
+ }
76
+ training_args = TrainingArguments(**args)
77
+
78
+ trainer = Trainer(model=model, args=training_args, train_dataset=ds)
79
+ result = trainer.train()
80
+
81
+ print(result)
82
+
83
+ # Now we can save_pretrained() or push_to_hub() to share the trained model!
84
+ ```
85
+
86
+ You may also find these [notebooks](https://huggingface.co/docs/transformers/notebooks) useful. Although they're not specific to HyenaDNA, they contain additional examples of training DNA and sequence classification models.
87
+
88
+ - [How to fine-tune a Nucleotide Transformer model](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb)
89
+ - [How to fine-tune a model on text classification](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
90
+
91
+ ### GPU requirements (suggested)
92
+ <a name="hardware"></a>
93
+
94
+ Here are suggestions on the hardware (preferred minimum) we think you can use for each model.
95
+
96
+ GPU during: Pretrain, fine-tune, inference
97
+
98
+ - [tiny-1k](https://huggingface.co/LongSafari/hyenadna-tiny-1k-seqlen/tree/main): (T4, T4, T4)
99
+ - [small-32k](https://huggingface.co/LongSafari/hyenadna-small-32k-seqlen/tree/main): (A100-40GB, T4, T4)
100
+ - [medium-160k](https://huggingface.co/LongSafari/hyenadna-medium-160k-seqlen/tree/main): (A100-40GB, T4, T4)
101
+ - [medium-450k](https://huggingface.co/LongSafari/hyenadna-medium-450k-seqlen/tree/main): (A100-40GB, A100-40GB, T4)
102
+ - [large-1m](https://huggingface.co/LongSafari/hyenadna-large-1m-seqlen/tree/main): (A100-80GB, A100-80GB, A100-40GB)
103
+
104
+
105
+ ## Model & Training Overview
106
+ <a name="model"></a>
107
+
108
+ HyenaDNA uses a simple stack of [Hyena](https://arxiv.org/abs/2302.10866) operators, which are a subquadratic drop-in replacement for attention in Transformers. The Hyena operator is able to match quality in language modeling by using modified input projections, implicit convolutions and gating, all subquadratic operations.
109
+
110
+ This enables HyenaDNA to reach context lengths of up to 500x longer than previous genomic Transformer models using dense attention, and train 160x faster at sequence length 1M (compared to Flash Attention).
111
+
112
+ We use a single character tokenizer with a primary vocab of 4 nucleotides (plus special tokens), enabling the single nucleotide resolution, a first in genomic foundation models. In addition, the implicit long convolution enables a **global receptive field** at each layer.
113
+
114
+ We pretrain using next token (nucleotide) prediction on the human reference genome (HG38).
115
+
116
+ HyenaDNA sets new SotA on 23 downstream tasks including predicting regulatory elements, chromatin profiles, and species classification. We also explore what new capabilities open up with long context in genomics, including the first use of in-context learning with soft prompt tuneable tokens and instruction fine-tuning.
117
+
118
+ Check out our [blog](https://hazyresearch.stanford.edu/blog/2023-06-29-hyena-dna) for more details on HyenaDNA!
119
+
120
+ ### Authors
121
+
122
+ Eric Nguyen*, Michael Poli*, Marjan Faizi*, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, Stefano Ermon, Stephen Baccus, Chris Re.
123
+
124
+ **Contact**
125
+
126
+ Eric Nguyen, etnguyen@stanford.edu
127
+ Michael Poli, poli@stanford.edu
128
+ Marjan Faizi, Marjan_Faizi@hms.harvard.edu
129
+
130
+
131
+ ## Citation
132
+
133
+
134
+ Feel free to cite us :)
135
+
136
+ ```
137
+ @article{nguyen2023hyenadna,
138
+ title={HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution},
139
+ author={Eric Nguyen and Michael Poli and Marjan Faizi and Armin Thomas and Callum Birch-Sykes and Michael Wornow and Aman Patel and Clayton Rabideau and Stefano Massaroli and Yoshua Bengio and Stefano Ermon and Stephen A. Baccus and Chris Ré},
140
+ year={2023},
141
+ eprint={2306.15794},
142
+ archivePrefix={arXiv},
143
+ primaryClass={cs.LG}
144
+ }
145
+
146
+ ```