Transformers
Inference Endpoints
exnx commited on
Commit
8a1e62e
1 Parent(s): 59a6457

update readme

Browse files
Files changed (1) hide show
  1. README.md +23 -19
README.md CHANGED
@@ -4,15 +4,15 @@ license: bsd-3-clause
4
 
5
  # HyenaDNA
6
 
7
- Welcome! HyenaDNA is a genomic foundation model pretrained on the human reference genome with sequence lengths of up to **1 million tokens** at **single nucleotide resolution**.
8
 
9
  See below for an [overview](#model) of the model and training. Better yet, check out these resources.
10
 
11
- **Checkout our other resources:**
12
 
13
- - [arxiv](https://arxiv.org/abs/2302.10866) (placeholder)
14
- - [blog](https://hazyresearch.stanford.edu/blog/2023-03-07-hyena) (placeholder)
15
- - [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL#scrollTo=dgkZggOetkbR)
16
  - [github](https://github.com/HazyResearch/hyena-dna)
17
 
18
 
@@ -28,9 +28,9 @@ See below for an [overview](#model) of the model and training. Better yet, check
28
  ### Sample snippet
29
 
30
 
31
- This code example lets you select which pretrained model to load from HuggingFace and do inference to get embeddings.
32
 
33
- See the `huggingface.py` script in the main [github](https://github.com/HazyResearch/hyena-dna), or the [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL#scrollTo=dgkZggOetkbR) for these classes.
34
 
35
 
36
  ```python
@@ -44,7 +44,7 @@ model = HyenaDNAPreTrainedModel.from_pretrained(
44
  pretrained_model_name,
45
  )
46
 
47
- # create tokenizer
48
  tokenizer = CharacterTokenizer(
49
  characters=['A', 'C', 'G', 'T', 'N'], # add DNA characters
50
  model_max_length=max_length,
@@ -70,16 +70,16 @@ print(embeddings.shape) # embeddings here!
70
 
71
  ### How to use pretrained weights
72
 
73
- - [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL#scrollTo=dgkZggOetkbR)
74
 
75
- The colab is the easiest entry point, you can finetune a small model, and do inference on DNA sequences up to 450k on the free tier (T4 GPU), and up to 1 million on the paid tier (A100). It handles all the HuggingFace integration for you, so it's helpful to see how.
76
 
77
  - [github](https://github.com/HazyResearch/hyena-dna)
78
 
79
- Otherwise, checkout of the main HyenaDNA repo for how to load weights into Pytorch Lightning. We use Pytorch Lightning for pretraining and fine-tuning most of our models. If you want to use our actual pretraining code, you can clone this HuggingFace repo to download the actual weights.ckpt, and then pass it.
80
 
81
 
82
- If you want a standalone version that's easy to port into your own code, we have that and a HuggingFace example in the repo too, under `huggingface.py`.
83
 
84
 
85
  ## Model & Training Overview
@@ -96,11 +96,11 @@ We pretrain using next token (nucleotide) prediction on the human reference geno
96
  HyenaDNA sets new SotA on 23 downstream tasks including predicting regulatory elements, chromatin profiles, and species classification. We also explore what new capabilities open up with long context in genomics, including the first use of in-context learning with soft prompt tuneable tokens and instruction fine-tuning.
97
 
98
 
99
- Check out our [blog](https://hazyresearch.stanford.edu/blog/2023-03-07-hyena) for more details on HyenaDNA!
100
 
101
  ### Authors
102
 
103
- Eric Nguyen*, Michael Poli*, Marjan Faizi*, Armin Thomas, Callum Birch Sykes, Michael Wornow, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, Stefano Ermon, Stephen Baccus, Chris Re.
104
 
105
  **Contact**
106
 
@@ -112,13 +112,17 @@ Marjan Faizi, Marjan_Faizi@hms.harvard.edu
112
  ## Citation
113
 
114
 
115
- If you use HyenaDNA in your work, feel free to cite us :)
116
 
117
  ```
118
-
119
-
120
-
121
-
 
 
 
 
122
 
123
  ```
124
 
 
4
 
5
  # HyenaDNA
6
 
7
+ Welcome! HyenaDNA is a long-range genomic foundation model pretrained on context lengths of up to **1 million tokens** at **single nucleotide resolution**.
8
 
9
  See below for an [overview](#model) of the model and training. Better yet, check out these resources.
10
 
11
+ **Resources:**
12
 
13
+ - [arxiv](https://arxiv.org/abs/2306.15794)
14
+ - [blog](https://hazyresearch.stanford.edu/blog/2023-06-29-hyena-dna)
15
+ - [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)
16
  - [github](https://github.com/HazyResearch/hyena-dna)
17
 
18
 
 
28
  ### Sample snippet
29
 
30
 
31
+ This code example lets you select which pretrained model to load from HuggingFace, perform inference and get embeddings.
32
 
33
+ See the [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing) for these classes, or the ['huggingface.py'](https://github.com/HazyResearch/hyena-dna/blob/main/huggingface.py) script in the main [github](https://github.com/HazyResearch/hyena-dna).
34
 
35
 
36
  ```python
 
44
  pretrained_model_name,
45
  )
46
 
47
+ # create tokenizer, no training involved :)
48
  tokenizer = CharacterTokenizer(
49
  characters=['A', 'C', 'G', 'T', 'N'], # add DNA characters
50
  model_max_length=max_length,
 
70
 
71
  ### How to use pretrained weights
72
 
73
+ - [colab](https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)
74
 
75
+ The colab is the easiest entry point, you can finetune a small model, and do inference on DNA sequences up to 450k on the free tier (T4 GPU), and up to 1 million on the paid tier (A100). It handles all the HuggingFace integration for you, so it's helpful to see this example first.
76
 
77
  - [github](https://github.com/HazyResearch/hyena-dna)
78
 
79
+ Otherwise, checkout of the main HyenaDNA repo for how to load weights into Pytorch Lightning. We use Pytorch Lightning for pretraining and fine-tuning all of our models. If you want to use our actual pretraining code, you can clone this HuggingFace repo to download the actual weights.ckpt, and then pass it to Pytorch Lightning via command line or config. See the [github](https://github.com/HazyResearch/hyena-dna) README for how to do all that.
80
 
81
 
82
+ If you want a standalone version that's easy to port into your own code (and not tied to our repo or Pytorch Lightning), we have that and a HuggingFace example in ['huggingface.py'](https://github.com/HazyResearch/hyena-dna/blob/main/huggingface.py) too.
83
 
84
 
85
  ## Model & Training Overview
 
96
  HyenaDNA sets new SotA on 23 downstream tasks including predicting regulatory elements, chromatin profiles, and species classification. We also explore what new capabilities open up with long context in genomics, including the first use of in-context learning with soft prompt tuneable tokens and instruction fine-tuning.
97
 
98
 
99
+ Check out our [blog](https://hazyresearch.stanford.edu/blog/2023-06-29-hyena-dna) for more details on HyenaDNA!
100
 
101
  ### Authors
102
 
103
+ Eric Nguyen*, Michael Poli*, Marjan Faizi*, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Stefano Massaroli, Clayton Rabideau, Yoshua Bengio, Stefano Ermon, Stephen Baccus, Chris Re.
104
 
105
  **Contact**
106
 
 
112
  ## Citation
113
 
114
 
115
+ Feel free to cite us :)
116
 
117
  ```
118
+ @article{nguyen2023hyenadna,
119
+ title={HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution},
120
+ author={Eric Nguyen and Michael Poli and Marjan Faizi and Armin Thomas and Callum Birch-Sykes and Michael Wornow and Aman Patel and Clayton Rabideau and Stefano Massaroli and Yoshua Bengio and Stefano Ermon and Stephen A. Baccus and Chris Ré},
121
+ year={2023},
122
+ eprint={2306.15794},
123
+ archivePrefix={arXiv},
124
+ primaryClass={cs.LG}
125
+ }
126
 
127
  ```
128