BacHyenaDNA
BacHyenaDNA is a collection of foundation model pretrained bacteria DNA at single nucleotide resolution. The models are based on the HyenaDNA [1] model originaly trained on the hg38 human genome. See more information on our github
The current available species are:
- Pseudomonas aeruginosa (seqlen = 32k, model_dim = 768)
BacHyenaDNA-Paeruginosa-32k-d768 overview
BacHyenaDNA-Paeruginosa-32k-d768 uses 10 HyenaDNA block layers, with inner_dim=3072, model_dim=768, max_seq_len=32768. It was pretrained on P. aeruginosa complete genomes from RefSeq database (802), all assembled P. aeruginosa genomes from GenBank (18466) and all plasmid sequences from P. aeruginosa found in PLSDB (373). It was pretrained using next token prediction, with a vocab of 4 nucleotides plus special tokens.
Get models embeddings
This code show an example of how to get the embeddings from the selected models. The classes needed can be found in our github repository BacHyenaDNA
from huggingface import HyenaDNAPreTrainedModel
from standalone_hyenadna import CharacterTokenizer
import torch
# instantiate pretrained model
pretrained_model_name = 'BacHyenaDNA-Paeruginosa-32k-d768'
max_length = 32768
device = "cuda:0"
model = HyenaDNAPreTrainedModel.from_pretrained(
'downloaded_models/',
pretrained_model_name,
)
# create tokenizer
tokenizer = CharacterTokenizer(
characters=['A', 'C', 'G', 'T', 'N'],
model_max_length=max_length,
)
# create a sample
sequence = 'ACTG'
tok_seq = tokenizer(sequence)["input_ids"]
# place on device, convert to tensor
tok_seq = torch.LongTensor(tok_seq).unsqueeze(0).to(device) # unsqueeze for batch dim
# model
model.to(device)
model.eval()
# forward
with torch.inference_mode():
embeddings = model(tok_seq)
print(embeddings)
GPU requirements (suggested)
GPU during: Pretrain, fine-tune, inference
Coming soon
Reference
[1] NGUYEN, Eric, POLI, Michael, FAIZI, Marjan, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems, 2023, vol. 36, p. 43177-43201.
- Downloads last month
- 2