nishimura999 commited on
Commit
394f930
1 Parent(s): bc3607d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md CHANGED
@@ -20,3 +20,105 @@ language:
20
  This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
 
22
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
 
22
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
23
+
24
+ # usage
25
+ ## -import
26
+ ```python
27
+ from transformers import (
28
+ AutoModelForCausalLM,
29
+ AutoTokenizer,
30
+ BitsAndBytesConfig,
31
+ )
32
+ import torch
33
+ from tqdm import tqdm
34
+ import json
35
+ ```
36
+
37
+ ## -setting
38
+ ```python
39
+ # Hugging Faceで取得したToken
40
+ HF_TOKEN = "{Your hugging face token}"
41
+
42
+ # モデルのID
43
+ model_name = "nishimura999/llm-jp-3-13b-finetune-v101"
44
+ ```
45
+
46
+ ## -confing
47
+ ```python
48
+ # QLoRA config
49
+ bnb_config = BitsAndBytesConfig(
50
+ load_in_4bit=True,
51
+ bnb_4bit_quant_type="nf4",
52
+ bnb_4bit_compute_dtype=torch.bfloat16,
53
+ bnb_4bit_use_double_quant=False,
54
+ )
55
+ ```
56
+ ## -load
57
+ ```python
58
+ # Load model
59
+ model = AutoModelForCausalLM.from_pretrained(
60
+ model_name,
61
+ quantization_config=bnb_config,
62
+ device_map="auto",
63
+ token = HF_TOKEN
64
+ )
65
+
66
+ # Load tokenizer
67
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, token = HF_TOKEN)
68
+ ```
69
+
70
+ ## -dataset
71
+ ```python
72
+ # データセットの読み込み。
73
+ datasets = []
74
+ with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
75
+ item = ""
76
+ for line in f:
77
+ line = line.strip()
78
+ item += line
79
+ if item.endswith("}"):
80
+ datasets.append(json.loads(item))
81
+ item = ""
82
+ ```
83
+
84
+ ## -generate
85
+ ```python
86
+ results = []
87
+ for data in tqdm(datasets):
88
+
89
+ input = data["input"]
90
+
91
+ prompt = f"""### 指示
92
+ {input}
93
+ ### 回答:
94
+ """
95
+
96
+ tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
97
+ with torch.no_grad():
98
+ outputs = model.generate(
99
+ tokenized_input,
100
+ max_new_tokens=100,
101
+ do_sample=False,
102
+ repetition_penalty=1.2
103
+ )[0]
104
+ output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True)
105
+
106
+ results.append({"task_id": data["task_id"], "input": input, "output": output})
107
+ ```
108
+
109
+ ## -output
110
+ ```python
111
+ import re
112
+ model_name = re.sub(".*/", "", model_name)
113
+ with open(f"./{model_name}-outputs.jsonl", 'w', encoding='utf-8') as f:
114
+ for result in results:
115
+ json.dump(result, f, ensure_ascii=False) # ensure_ascii=False for handling non-ASCII characters
116
+ f.write('\n')
117
+ ```
118
+
119
+ # ref
120
+ ### 本モデルは下記のデータを使ってファインチューニングしております。ここでデータ提供者に感謝申し上げます。
121
+ (https://liat-aip.sakura.ne.jp/wp/llmのための日本語インストラクションデータ作成/llmのための日本語インストラクションデータ-公開/)
122
+ 関根聡, 安藤まや, 後藤美知子, 鈴木久美, 河原大輔, 井之上直也, 乾健太郎.
123
+ ichikara-instruction: LLMのための日本語インストラクションデータの構築. 言語処理学会第30回年次大会(2024)
124
+