dna language model trained using gpt2. using human genome data.

Key features of our dangpt models:

  1. BPE tokenization instead of k-mers (DNABERT, DNABERT2 also use BPE)
  2. GPT model, but not bert(DNABERT, GENA_LM)
  3. pre-training on the latest T2T human genome assembly
  4. details:https://github.com/maris205/dnagpt. includes train/bpe code

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1')
#result: [G','AGCAC','ATTCGCC',....]

model = AutoModel.from_pretrained('dnagpt/human_gpt2-v1')
import torch
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)[0] # [1, sequence_length, 768]

# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768

