--- license: mit --- ### Continuous Pre-training of DNA Sequence Data Based on the LLaMA Model We perform continuous pre-training of DNA sequence data based on the LLaMA model. This involves using a comprehensive and diverse dataset to further enhance the model's understanding and representation of genomic information. Specifically: - **DNA Sequence Data:** - We follow the pre-training data approach used by DNABert, extracting fragments of 300 to 1000 base pairs (bp) from multiple model organisms. The total data volume for DNA sequences is approximately 16 GB. By continuously pre-training the LLaMA model with this DNA sequence data, we ensure that the model remains up-to-date with the latest genomic discoveries and maintains its ability to generalize well across different genomics tasks. This continuous learning process helps to improve the model's accuracy and robustness in handling complex biological sequences. ```python from transformers import AutoTokenizer, AutoConfig,AutoModel from transformers import DataCollatorForLanguageModeling from transformers import Trainer, TrainingArguments from transformers import AutoConfig, AutoModelForCausalLM,LlamaForCausalLM,LlamaTokenizer from tokenizers import Tokenizer from datasets import load_dataset tokenizer = LlamaTokenizer.from_pretrained("dnagpt/llama-dna") tokenizer.pad_token = tokenizer.eos_token model = LlamaForCausalLM.from_pretrained("dnagpt/llama-dna") #continue pretrain text='''GCTGACTCTGCCAGGATGGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCTAGTCAGACATAGTCACATAGGCAAGTAAGGGAACCTAAAATTGCTTGGAAT, The primary use of LLaMA is research on large language models, including''' print(f"Tokenized by DNA-LLaMA tokenizer:{tokenizer.tokenize(text)}") import torch from transformers import pipeline model_id = "dnagpt/llama-dna" pipe = pipeline( "text-generation", model=model_id, #torch_dtype=torch.bfloat16, device_map="auto", ) print(pipe("The key to life is")) print(pipe("GGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCT")) ```