monsoon-nlp
commited on
Commit
•
4a226bb
1
Parent(s):
3bb7c4d
Update README.md
Browse files
README.md
CHANGED
@@ -41,6 +41,49 @@ Write information about the nucleotide sequence.
|
|
41 |
Information about location in the kaniwa chromosome: >lcl|Cp5
|
42 |
```
|
43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|
45 |
|
46 |
|
|
|
41 |
Information about location in the kaniwa chromosome: >lcl|Cp5
|
42 |
```
|
43 |
|
44 |
+
## Usage
|
45 |
+
|
46 |
+
### Basic inference
|
47 |
+
|
48 |
+
```python
|
49 |
+
from peft import AutoPeftModelForCausalLM
|
50 |
+
from transformers import AutoTokenizer
|
51 |
+
|
52 |
+
model = AutoPeftModelForCausalLM.from_pretrained("monsoon-nlp/llama3-biotokenpretrain-kaniwa", load_in_4bit=True).to("cuda")
|
53 |
+
tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/llama3-biotokenpretrain-kaniwa")
|
54 |
+
tokenizer.pad_token = tokenizer.eos_token # pad fix
|
55 |
+
|
56 |
+
qed = "∎" # from math symbols, used in pretraining
|
57 |
+
sequence = "".join([(qed + nt) for nt in "GCCTATAGTGTGTAGCTAATGAGCCTAGGTTATCGACCCTAATCT"])
|
58 |
+
|
59 |
+
inputs = tokenizer(f"{prefix}{sequence}{annotation}", return_tensors="pt")
|
60 |
+
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
|
61 |
+
sample = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
|
62 |
+
```
|
63 |
+
|
64 |
+
### LoRA finetuning on a new task
|
65 |
+
|
66 |
+
```python
|
67 |
+
from trl import SFTTrainer
|
68 |
+
from unsloth import FastLanguageModel
|
69 |
+
|
70 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
71 |
+
model_name = "monsoon-nlp/llama3-biotokenpretrain-kaniwa",
|
72 |
+
max_seq_length = 7_000, # max 6,000 bp for AgroNT tasks
|
73 |
+
dtype = None,
|
74 |
+
load_in_4bit = True,
|
75 |
+
resize_model_vocab=128260, # includes biotokens
|
76 |
+
)
|
77 |
+
tokenizer.pad_token = tokenizer.eos_token # pad fix
|
78 |
+
|
79 |
+
trainer = SFTTrainer(
|
80 |
+
model = model,
|
81 |
+
tokenizer = tokenizer,
|
82 |
+
...
|
83 |
+
)
|
84 |
+
```
|
85 |
+
|
86 |
+
|
87 |
This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|
88 |
|
89 |
|