togethercomputer
/

evo-1-8k-base

@@ -39,6 +39,8 @@ As part of our commitment to open science, we release **weights of 15 intermedia
 StripedHyena is a deep signal processing, hybrid architecture composed of multi-head attention and gated convolutions arranged in [Hyena](https://arxiv.org/abs/2302.10866) blocks, improving over decoder-only Transformers.
 Some highlights of the architecture:
 - **Efficient autoregressive generation** via a recurrent mode (>500k generation with a single 80GB GPU)
 - **Significantly faster training and finetuning** at long context (>3x at 131k)
@@ -54,10 +56,10 @@ Some highlights of the architecture:
 One of the advantages of deep signal processing models is their flexibility. Different parametrizations of convolutions can be used depending on the memory, expressivity and causality requirements of pretraining, finetuning or inference workloads.
 The main classes are:
-- Modal: unconstrained poles ([reference](https://arxiv.org/pdf/2203.14343.pdf), [reference](https://arxiv.org/abs/2310.18780)), or constrained poles ([reference](https://arxiv.org/abs/2206.11893), [reference](https://arxiv.org/pdf/2303.06349.pdf))
-- Canonical / Rational: TBA
 - Hypernetworks: hypernetwork ([reference](https://arxiv.org/abs/2102.02611)), modulated hypernetwork ([reference](https://arxiv.org/abs/2302.10866)).
-- Explicit: modulated explicit ([reference](https://arxiv.org/pdf/2210.09298.pdf))
 StripedHyena is a mixed precision model. Make sure to keep your `poles` and `residues` in `float32` precision, especially for longer prompts or training.

 StripedHyena is a deep signal processing, hybrid architecture composed of multi-head attention and gated convolutions arranged in [Hyena](https://arxiv.org/abs/2302.10866) blocks, improving over decoder-only Transformers.
+StripedHyena is designed to leverage the specialization of each of its layer classes, with Hyena layers implementing the bulk of the computation required for sequence processing and attention layers supplementing the ability to perform targeted pattern recall.
 Some highlights of the architecture:
 - **Efficient autoregressive generation** via a recurrent mode (>500k generation with a single 80GB GPU)
 - **Significantly faster training and finetuning** at long context (>3x at 131k)
 One of the advantages of deep signal processing models is their flexibility. Different parametrizations of convolutions can be used depending on the memory, expressivity and causality requirements of pretraining, finetuning or inference workloads.
 The main classes are:
+- Modal: unconstrained poles ([reference](https://arxiv.org/pdf/2203.14343.pdf), [reference](https://arxiv.org/abs/2310.18780)), or constrained poles ([reference](https://arxiv.org/abs/2206.11893), [reference](https://arxiv.org/pdf/2303.06349.pdf)).
+- Canonical / Rational: TBA.
 - Hypernetworks: hypernetwork ([reference](https://arxiv.org/abs/2102.02611)), modulated hypernetwork ([reference](https://arxiv.org/abs/2302.10866)).
+- Explicit: modulated explicit ([reference](https://arxiv.org/pdf/2210.09298.pdf)).
 StripedHyena is a mixed precision model. Make sure to keep your `poles` and `residues` in `float32` precision, especially for longer prompts or training.