IlyaGusev commited on
Commit
fc8f1e8
1 Parent(s): c27b29b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - IlyaGusev/ru_turbo_alpaca
4
+ - IlyaGusev/ru_turbo_saiga
5
+ - IlyaGusev/ru_sharegpt_cleaned
6
+ - IlyaGusev/oasst1_ru_main_branch
7
+ - IlyaGusev/ru_turbo_alpaca_evol_instruct
8
+ - lksy/ru_instruct_gpt4
9
+ language:
10
+ - ru
11
+ pipeline_tag: conversational
12
+ license: cc-by-4.0
13
+ ---
14
+
15
+ # Saiga 7B, Russian LLaMA-based chatbot
16
+
17
+ Based on [LLaMA-2 7B HF](https://huggingface.co/meta-llama/Llama-2-7b-hf).
18
+
19
+ * This is an adapter-only version.
20
+
21
+ Training code: [link](https://github.com/IlyaGusev/rulm/tree/master/self_instruct)
22
+
23
+ WARNING: Run with the development version of `transformers` and `peft`!
24
+
25
+ ```python
26
+ from peft import PeftModel, PeftConfig
27
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
28
+
29
+ MODEL_NAME = "IlyaGusev/saiga2_7b_lora"
30
+ DEFAULT_MESSAGE_TEMPLATE = "<s>{role}\n{content}</s>\n"
31
+ DEFAULT_SYSTEM_PROMPT = "Ты — Сайга, русскоязычный автоматический ассистент. Ты разговариваешь с людьми и помогаешь им."
32
+
33
+ class Conversation:
34
+ def __init__(
35
+ self,
36
+ message_template=DEFAULT_MESSAGE_TEMPLATE,
37
+ system_prompt=DEFAULT_SYSTEM_PROMPT,
38
+ start_token_id=1,
39
+ bot_token_id=9225
40
+ ):
41
+ self.message_template = message_template
42
+ self.start_token_id = start_token_id
43
+ self.bot_token_id = bot_token_id
44
+ self.messages = [{
45
+ "role": "system",
46
+ "content": system_prompt
47
+ }]
48
+
49
+ def get_start_token_id(self):
50
+ return self.start_token_id
51
+
52
+ def get_bot_token_id(self):
53
+ return self.bot_token_id
54
+
55
+ def add_user_message(self, message):
56
+ self.messages.append({
57
+ "role": "user",
58
+ "content": message
59
+ })
60
+
61
+ def add_bot_message(self, message):
62
+ self.messages.append({
63
+ "role": "bot",
64
+ "content": message
65
+ })
66
+
67
+ def get_prompt(self, tokenizer):
68
+ final_text = ""
69
+ for message in self.messages:
70
+ message_text = self.message_template.format(**message)
71
+ final_text += message_text
72
+ final_text += tokenizer.decode([self.start_token_id, self.bot_token_id])
73
+ return final_text.strip()
74
+
75
+
76
+ def generate(model, tokenizer, prompt, generation_config):
77
+ data = tokenizer(prompt, return_tensors="pt")
78
+ data = {k: v.to(model.device) for k, v in data.items()}
79
+ output_ids = model.generate(
80
+ **data,
81
+ generation_config=generation_config
82
+ )[0]
83
+ output_ids = output_ids[len(data["input_ids"][0]):]
84
+ output = tokenizer.decode(output_ids, skip_special_tokens=True)
85
+ return output.strip()
86
+
87
+ config = PeftConfig.from_pretrained(MODEL_NAME)
88
+ model = AutoModelForCausalLM.from_pretrained(
89
+ config.base_model_name_or_path,
90
+ load_in_8bit=True,
91
+ torch_dtype=torch.float16,
92
+ device_map="auto"
93
+ )
94
+ model = PeftModel.from_pretrained(
95
+ model,
96
+ MODEL_NAME,
97
+ torch_dtype=torch.float16
98
+ )
99
+ model.eval()
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
102
+ generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
103
+ print(generation_config)
104
+
105
+ inputs = ["Почему трава зеленая?", "Сочини длинный рассказ, обязательно упоминая следующие объекты. Дано: Таня, мяч"]
106
+ for inp in inputs:
107
+ conversation = Conversation()
108
+ conversation.add_user_message(inp)
109
+ prompt = conversation.get_prompt(tokenizer)
110
+
111
+ output = generate(model, tokenizer, prompt, generation_config)
112
+ print(inp)
113
+ print(output)
114
+ print()
115
+ print("==============================")
116
+ print()
117
+ ```
118
+
119
+ Examples:
120
+ ```
121
+ User: Почему трава зеленая?
122
+ Saiga:
123
+ ```
124
+
125
+ ```
126
+ User: Сочини длинный рассказ, обязательно упоминая следующие объекты. Дано: Таня, мяч
127
+ Saiga:
128
+
129
+ ```
130
+
131
+ v1:
132
+ - dataset code revision 7712a061d993f61c49b1e2d992e893c48acb3a87
133
+ - wandb [link](https://wandb.ai/ilyagusev/rulm_self_instruct/runs/innzu7g8)
134
+ - 7 datasets: ru_turbo_alpaca, ru_turbo_saiga, ru_sharegpt_cleaned, oasst1_ru_main_branch, gpt_roleplay_realm, ru_turbo_alpaca_evol_instruct (iteration 1/2), ru_instruct_gpt4
135
+ - Datasets merging script: [create_chat_set.py](https://github.com/IlyaGusev/rulm/blob/e4238fd9a196405b566a2d5838ab44b7a0f4dc31/self_instruct/src/data_processing/create_chat_set.py)
136
+ - saiga7b vs saiga2_7b: 78-8-90