--- license: apache-2.0 datasets: - dnagpt/human_genome_GCF_009914755.1 language: - en metrics: - perplexity library_name: transformers tags: - biology --- dna language model trained using gpt2. using human genome data. Key features of our dangpt models: 1. BPE tokenization instead of k-mers (DNABERT, DNABERT2 also use BPE) 2. GPT model, but not bert(DNABERT, GENA_LM) 3. pre-training on the latest T2T human genome assembly 4. details:https://github.com/maris205/dnagpt. includes train/bpe code ``` from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1') tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG") #result: [G','AGCAC','ATTCGCC',....] model = AutoModel.from_pretrained('dnagpt/human_gpt2-v1') import torch dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC" inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"] hidden_states = model(inputs)[0] # [1, sequence_length, 768] # embedding with mean pooling embedding_mean = torch.mean(hidden_states[0], dim=0) print(embedding_mean.shape) # expect to be 768 # embedding with max pooling embedding_max = torch.max(hidden_states[0], dim=0)[0] print(embedding_max.shape) # expect to be 768