monsoon-nlp commited on
Commit
94f0f55
1 Parent(s): be0836a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ library_name: peft
4
+ language:
5
+ - en
6
+ tags:
7
+ - trl
8
+ - sft
9
+ - unsloth
10
+ - dna
11
+ base_model: unsloth/llama-3-8b-bnb-4bit
12
+ model-index:
13
+ - name: llama3-biotoken3pretrain-kaniwa
14
+ results: []
15
+ ---
16
+
17
+ # llama3-biotoken3pretrain-kaniwa
18
+
19
+
20
+ This is a LoRA adapter.
21
+
22
+ The base model is Llama 3 quantized by Unsloth: `unsloth/llama-3-8b-bnb-4bit`
23
+
24
+ The tokenizer has added "biotokens" ∎A, ∎C, ∎G, and ∎T.
25
+
26
+ The dataset was ~20% of BYU's 2019 kaniwa (*Chenopodium pallidicaule*) genome, from https://genomevolution.org/coge/GenomeInfo.pl?gid=53872
27
+
28
+ The adapter was finetuned for several hours on an A100 GPU. The data was split into ~6k nucleotide snippets with an Alpaca like message format.
29
+
30
+ Training Notebook (before copying over to Lambda): https://colab.research.google.com/drive/1IrRBC2LKlU7_7zjzmmzslT0uDOacwyfO?usp=sharing
31
+
32
+ Sample message:
33
+ ```
34
+ Write information about the nucleotide sequence.
35
+
36
+ ### Sequence:
37
+ ∎G∎C∎C∎T∎A∎T∎A∎G∎T∎G∎T∎G∎T∎A∎G...
38
+
39
+ ### Annotation:
40
+ Information about location in the kaniwa chromosome: >lcl|Cp5
41
+ ```
42
+
43
+ ## Usage
44
+
45
+ ### Inference with DNA sequence
46
+
47
+ ```python
48
+ from peft import AutoPeftModelForCausalLM
49
+ from transformers import AutoTokenizer
50
+
51
+ model = AutoPeftModelForCausalLM.from_pretrained("monsoon-nlp/llama3-biotoken3pretrain-kaniwa", load_in_4bit=True).to("cuda")
52
+ tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/llama3-biotoken3pretrain-kaniwa")
53
+ tokenizer.pad_token = tokenizer.eos_token # pad fix
54
+
55
+ qed = "∎" # from math symbols, used in pretraining
56
+ sequence = "".join([(qed + nt.upper()) for nt in "GCCTATAGTGTGTAGCTAATGAGCCTAGGTTATCGACCCTAATCT"])
57
+
58
+ inputs = tokenizer(f"{prefix}{sequence}{annotation}", return_tensors="pt")
59
+ outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
60
+ sample = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
61
+ ```
62
+
63
+ ### LoRA finetuning on a new task
64
+
65
+ ```python
66
+ from transformers import AutoTokenizer
67
+ from trl import SFTTrainer
68
+ from unsloth import FastLanguageModel
69
+
70
+ model, _ = FastLanguageModel.from_pretrained(
71
+ model_name = "monsoon-nlp/llama3-biotoken3pretrain-kaniwa",
72
+ max_seq_length = 6_500, # max 6,000 bp for AgroNT tasks
73
+ dtype = None,
74
+ load_in_4bit = True,
75
+ resize_model_vocab=128260, # includes biotokens
76
+ )
77
+ tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/llama3-biotoken3pretrain-kaniwa")
78
+ tokenizer.pad_token = tokenizer.eos_token # pad fix
79
+
80
+ trainer = SFTTrainer(
81
+ model = model,
82
+ tokenizer = tokenizer,
83
+ ...
84
+ )
85
+ ```
86
+
87
+
88
+ This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
89
+
90
+
91
+ ### Genome Citation
92
+
93
+ Mangelson H, et al. The genome of *Chenopodium pallidicaule*: an emerging Andean super grain. Appl. Plant Sci. 2019;7:e11300. doi: 10.1002/aps3.11300