Edit model card

dna language model trained using gpt2. using human genome data.

Key features of our dangpt models:

  1. BPE tokenization instead of k-mers (DNABERT, DNABERT2 also use BPE)
  2. GPT model, but not bert(DNABERT, GENA_LM)
  3. pre-training on the latest T2T human genome assembly
  4. details:https://github.com/maris205/dnagpt. includes train/bpe code

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1')
tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG")
#result: [G','AGCAC','ATTCGCC',....]

model = AutoModel.from_pretrained('dnagpt/human_gpt2-v1')
import torch
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)[0] # [1, sequence_length, 768]

# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768

Downloads last month
25
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train dnagpt/human_gpt2-v1