Instructions to use Taykhoom/Evo1-1-7B-8K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/Evo1-1-7B-8K with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Taykhoom/Evo1-1-7B-8K with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Taykhoom/Evo1-1-7B-8K" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Taykhoom/Evo1-1-7B-8K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Taykhoom/Evo1-1-7B-8K
- SGLang
How to use Taykhoom/Evo1-1-7B-8K with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Taykhoom/Evo1-1-7B-8K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Taykhoom/Evo1-1-7B-8K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Taykhoom/Evo1-1-7B-8K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Taykhoom/Evo1-1-7B-8K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Taykhoom/Evo1-1-7B-8K with Docker Model Runner:
docker model run hf.co/Taykhoom/Evo1-1-7B-8K
Evo1-1-7B-8K
A clean, minimal HuggingFace port of Evo 1 (8k), the original ~7B-parameter StripedHyena DNA foundation model. Native support for layer-by-layer hidden state extraction, attention-weight extraction, and a runtime-switchable attention backend.
Why this port?
togethercomputer/evo-1-8k-base ships a trust_remote_code HF implementation but it has four gaps that force every downstream user to monkey-patch the model:
output_hidden_states=Trueis hardcoded toNone(intermediate embeddings require forward hooks).output_attentions=Trueis unsupported (flash-attn discards the(B, H, T, T)matrix; users must patch the attention module).attn_implementationcannot be switched at load time - flash_attn is mandatory at every attention layer.- The bare backbone is not exposed via
AutoModel.from_pretrained; only the LM-head wrapper exists.
This repo fixes all four. The math is bit-exact with the togethercomputer reference (max_abs_diff = 0.000e+00 at every layer; see Parity Verification).
Architecture
| Parameter | Value |
|---|---|
| Total parameters | ~7B |
| Layers | 32 |
| Attention heads | 32 |
| Embedding dimension | 4096 |
| Inner MLP size | 10928 |
| Vocabulary size | 512 (UTF-8 byte-level) |
| Attention layer indices | [8, 16, 24] |
| Hyena layer indices | all others |
| Hyena state size | 8 |
| Positional encoding | RoPE (base = 10000) |
| Architecture | StripedHyena (alternating Hyena / MHA blocks) |
| Max sequence length | 8 192 |
| Training dtype | bfloat16 (Hyena modal-form poles / residues kept in fp32) |
Pretraining
- Objective: causal byte-level next-token prediction.
- Data: OpenGenome, ~300B tokens of prokaryotic whole-genome DNA.
- Source checkpoint:
togethercomputer/evo-1-8k-base@1.1_fix.
Parity Verification
Hidden-state representations verified bit-exact (max_abs_diff = 0.000e+00) to the togethercomputer reference at all 33 representation levels (token embedding + each of the 32 transformer blocks + final RMSNorm), using attn_implementation="flash_attention_2" in bf16 (matches the reference's backend choice and the trained dtype). Logits from Evo1ForCausalLM were also verified bit-exact (top-1 agreement: 128/128 positions). Verified on H100 with PyTorch 2.7.1 / CUDA 12.9.
Numerical equivalence across attention backends
flash_attention_2 is bit-exact with the original togethercomputer / evo-design implementations (same CUDA kernel). The sdpa and eager backends use different kernels (PyTorch's bundled flash kernel and pure-PyTorch matmul, respectively); these compute mathematically equivalent attention but accumulate floating-point operations in slightly different orders, producing per-block diffs at the bf16 noise floor (relative error roughly 1e-4 to 1e-2).
Unlike a standard transformer, where attention is softmax-bounded and per-block diffs stay small through the stack, StripedHyena's Hyena layers use an unbounded-gain IIR filter (no softmax) - so any small per-attention-block diff gets amplified by Hyena's filter gain. Across 32 layers this compounds to ~1% relative error in the intermediate residual stream, though the final post-RMSNorm output is bounded. Use flash_attention_2 if you need to match the reference's activations bit-for-bit.
Related Models
See the full Evo1 collection on the Hub.
| Model | Context | Notes |
|---|---|---|
| Taykhoom/Evo1-1-7B-8K | 8 192 | Original Evo 1 base model (8k context). |
| Taykhoom/Evo1-1-7B-131K | 131 072 | Long-context Evo 1 with linearly-scaled RoPE (131k context). |
| Taykhoom/Evo1-1.5-7B-8K | 8 192 | Evo 1.5: Evo 1 (8k) further trained on ~50% more pretraining tokens. |
Usage
Note on dtype. Evo1 was trained in bfloat16, with the Hyena
poles/residues(modal-form filter parameters) kept in fp32 for numerical stability. Passingdtype=...tofrom_pretrainedonly affects the initial load precision (peak memory during loading) and does not change the inference dtype -Evo1Model.__init__andEvo1ForCausalLM.__init__unconditionally callto_bfloat16_except_poles_residues(), so the model always runs in bf16 with poles/residues in fp32. This is intentional: the trained activations are bf16-stable and fp16-unstable, and the modal-form filter requires fp32 for numerical stability - a single mixed config is the only valid one.
Note on attention backend. By default,
from_pretrainedselectsattn_implementation="sdpa"(PyTorch's bundled scaled-dot-product-attention) - this works out of the box withoutflash_attninstalled. The original togethercomputer / evo-design implementations useflash_attnunconditionally; for bit-exact reproduction of reference outputs, explicitly passattn_implementation="flash_attention_2"(andpip install flash-attn). See Numerical equivalence across attention backends for the magnitude of the difference.
Embedding generation (no LM head)
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
"Taykhoom/Evo1-1-7B-8K",
trust_remote_code=True,
attn_implementation="flash_attention_2", # bit-exact with reference; or omit to default to "sdpa"
).cuda().eval()
seqs = ["ACGTACGTACGT", "GGGTTTAAACCC"]
inputs = tokenizer(seqs, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
out = model(**inputs, output_hidden_states=True)
last_hidden = out.last_hidden_state # (B, T, 4096)
all_layers = out.hidden_states # tuple of (B, T, 4096), len = 34 (embed + 32 blocks + post-norm)
layer_12_emb = all_layers[12] # often used as a "middle" representation
LM logits
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"Taykhoom/Evo1-1-7B-8K",
trust_remote_code=True,
attn_implementation="flash_attention_2",
).cuda().eval()
inputs = tokenizer(["ACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
logits = model(**inputs).logits # (1, T, 512)
Generation
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"Taykhoom/Evo1-1-7B-8K",
trust_remote_code=True,
attn_implementation="flash_attention_2",
).cuda().eval()
inputs = tokenizer(["ACGT"], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=True, top_k=4, temperature=1.0)
print(tokenizer.decode(out[0]))
generation_config.json ships with eos_token_id = 0 (the EOD byte) and pad_token_id = 1 so model.generate() stops naturally at the trained end-of-document token without needing extra kwargs. Note that the tokenizer itself does not add an EOS at encoding time - this matches the original Evo1 inference pipeline (only generation stops on EOS; embedding/scoring uses raw byte input).
Attention weights
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
"Taykhoom/Evo1-1-7B-8K",
trust_remote_code=True,
attn_implementation="eager", # required for output_attentions to populate
).cuda().eval()
inputs = tokenizer(["ACGTACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model(**inputs, output_attentions=True)
# out.attentions is a tuple of length 32. Entries at indices not in [8, 16, 24]
# are None (Hyena blocks have no attention matrix). Entries at [8, 16, 24] are
# (B, num_heads, T, T) tensors.
attn_block_8 = out.attentions[8]
Multi-GPU loading (optional)
Loading via accelerate's device_map is supported (_no_split_modules is set so each AttentionBlock / ParallelGatedConvBlock stays atomic on one device, with hidden state automatically transferred across device boundaries):
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
"Taykhoom/Evo1-1-7B-8K",
trust_remote_code=True,
attn_implementation="flash_attention_2",
device_map="auto", # auto-shard across all visible GPUs; falls back to single GPU if only one is present
).eval()
Requires pip install accelerate.
Fine-tuning
Standard HuggingFace conventions. For sequence-level tasks, take the final last_hidden_state (or any intermediate hidden_states[i]) and feed it into a downstream head.
Implementation Notes
- Custom attention module (
attention.py). Replacesflash_attn.modules.mha.MHAwith a small in-repoMHAclass that supportsattn_implementation="eager"/"sdpa"/"flash_attention_2". Parameter names (Wqkv,out_proj,rotary_emb.inv_freq) are preserved so existing checkpoints load unchanged. Whenoutput_attentions=True, the sdpa and flash paths automatically fall back to eager so the attention matrix is materialized. - Custom rotary embedding (
rotary.py). Whenflash_attnis installed we delegate to its Triton kernel (faster on long sequences). The pure-PyTorch fallback does the rotary multiply in fp32 internally (then casts back) so it produces bit-exactly identical results to the Triton kernel - a bf16 multiply here introduces ~3e-2 error per layer that compounds to ~1% relative across 32 layers. - Hyena engine (
engine.py). Copied verbatim from the togethercomputer reference (FFT-based long convolution, modal-form prefill). - Cache subclass (
cache.py).Evo1Cache(transformers.cache_utils.Cache)wraps the two block-type-specific inference param dataclasses (InferenceParamsfor attention KV cache,RecurrentInferenceParamsfor Hyena FIR window + IIR modal state). Exposesget_seq_length()/get_max_cache_shape()so HF'smodel.generate()can introspect cache state; falls through tocache["mha"]/cache["hyena"]for the model internals. - Tokenizer (
tokenization_evo1.py). Byte-level UTF-8 with vocab_size = 512. Pad token is byte\x01. No CLS, no EOS appended at encoding time (matches original Evo1 inference). The_decodemethod is numpy-2.x compatible (the originalnp.uint8.clip(min=32, max=512)was an overflow on numpy 2). - Dependencies.
torch,transformers,numpy,safetensors,huggingface_hub(only forfrom_pretraineddownloads).flash_attnis only required if you passattn_implementation="flash_attention_2".
Citation
@article{nguyen2024_evo,
title = {Sequence modeling and design from molecular to genome scale with {Evo}},
author = {Nguyen, Eric and Poli, Michael and Durrant, Matthew G. and Kang, Brian and Katrekar, Dhruva and Li, David B. and Bartie, Liam J. and Thomas, Armin W. and King, Samuel H. and Brixi, Garyk and Sullivan, Jeremy and Ng, Madelena Y. and Lewis, Ashley and Lou, Aaron and Ermon, Stefano and Baccus, Stephen A. and Hernandez-Boussard, Tina and {R{\'e}}, Christopher and Hsu, Patrick D. and Hie, Brian L.},
journal = {Science},
volume = {386},
number = {6723},
pages = {eado9336},
year = {2024},
doi = {10.1126/science.ado9336}
}
Credits
Original Evo1 model and code by Nguyen et al. Source repo: evo-design/evo. Source checkpoint: togethercomputer/evo-1-8k-base.
The HuggingFace conversion code in this repo was authored primarily by Claude and reviewed manually by Taykhoom Dalal.
License
Apache 2.0, following the original Evo1 release.
- Downloads last month
- 79