ptrdvn commited on
Commit
73d763f
β€’
1 Parent(s): ada9338

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +256 -0
README.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: llama-3
4
+ license_link: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/raw/main/LICENSE
5
+
6
+ base_model: meta-llama/Meta-Llama-3-8B-Instruct
7
+ tags:
8
+ - generated_from_trainer
9
+ model-index:
10
+ - name: lightblue/suzume-llama-3-8B-multilingual
11
+ results: []
12
+ ---
13
+
14
+ <p align="center">
15
+ <img width=400 src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/kg3QjQOde0X743csGJT-f.png" alt="Suzume - a Japanese tree sparrow"/>
16
+ </p>
17
+
18
+ # Suzume
19
+
20
+ This Suzume 8B, a multilingual finetune of Llama 3.
21
+
22
+ Llama 3 has exhibited excellent performance on many English language benchmarks.
23
+ However, it also seemingly been finetuned on mostly English data, meaning that it will respond in English, even if prompted in other languages.
24
+
25
+ We have fine-tuned Llama 3 on almost 90,000 multilingual conversations meaning that this model has the smarts of Llama 3 but has the added ability to chat in more languages.
26
+
27
+ Please feel free to comment on this model and give us feedback in the Community tab!
28
+
29
+ # How to use
30
+
31
+ The easiest way to use this model on your own computer is to use the [GGUF version of this model (lightblue/suzume-llama-3-8B-multilingual-gguf)](https://huggingface.co/lightblue/suzume-llama-3-8B-multilingual-gguf) using a program such as (jan.ai)[https://jan.ai/] or [LM Studio](https://lmstudio.ai/).
32
+
33
+ If you want to use this model directly in Python, we recommend using vLLM for the fastest inference speeds.
34
+
35
+ ```python
36
+ from vllm import LLM, SamplingParams
37
+
38
+ sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
39
+ llm = LLM(model="lightblue/suzume-llama-3-8B-multilingual")
40
+
41
+ messages = []
42
+ messages.append({"role": "user", "content": "Bonjour!"})
43
+ prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
44
+ prompts = [prompt]
45
+
46
+ outputs = llm.generate(prompts, sampling_params)
47
+ for output in outputs:
48
+ prompt = output.prompt
49
+ generated_text = output.outputs[0].text
50
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
51
+ ```
52
+
53
+ # Evaluation scores
54
+
55
+ We achieve the following MT-Bench scores across 6 languages:
56
+
57
+ | | **meta-llama/Meta-Llama-3-8B-Instruct** | **lightblue/suzume-llama-3-8B-multilingual** | **Nexusflow/Starling-LM-7B-beta** | **gpt-3.5-turbo** |
58
+ |-----------------|-----------------------------------------|----------------------------------------------|-----------------------------------|-------------------|
59
+ | **German** πŸ‡©πŸ‡ͺ | NaN | 7.26 | 6.99 | 7.68 |
60
+ | **French** πŸ‡«πŸ‡· | NaN | 7.66 | 7.29 | 7.74 |
61
+ | **Japanese** πŸ‡―πŸ‡΅ | NaN | 6.56 | 6.22 | 7.84 |
62
+ | **Russian** πŸ‡·πŸ‡Ί | NaN | 8.19 | 8.28 | 7.94 |
63
+ | **Chinese** πŸ‡¨πŸ‡³ | NaN | 7.11 | 6.97 | 7.55 |
64
+ | **English** πŸ‡ΊπŸ‡Έ | 7.98 | 7.73 | 7.92 | 8.26 |
65
+
66
+ We observe minimal degredation of Llama 3's English ability while achieving best-in-class multilingual abilities compared to the top rated 7B model ([Nexusflow/Starling-LM-7B-beta](https://huggingface.co/Nexusflow/Starling-LM-7B-beta)) on the [Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard).
67
+
68
+ [Here is our evaluation script.](https://drive.google.com/file/d/15HPn7452t8LbTD9HKSl7ngYYWnsoOG08/view?usp=sharing)
69
+
70
+ # Training data
71
+
72
+ We train on three sources of data to create this model:
73
+
74
+ * [lightblue/tagengo-gpt4](https://huggingface.co/datasets/lightblue/tagengo-gpt4) - 76,338 conversations
75
+ * A diverse dataset of initial inputs sampled from [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) and then used to prompt `gpt-4-0125-preview`
76
+ * [megagonlabs/instruction_ja](https://github.com/megagonlabs/instruction_ja) - 669 conversations
77
+ * A hand-edited dataset of nearly 700 Japanese conversations taken originally from translations of the [kunishou/hh-rlhf-49k-ja](https://huggingface.co/datasets/kunishou/hh-rlhf-49k-ja) dataset.
78
+ * [openchat/openchat_sharegpt4_dataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json) - 6,206 conversations
79
+ * Multilingual conversations of humans talking to GPT-4.
80
+
81
+
82
+ <details><summary>We prepare our data like so:</summary>
83
+
84
+ ```python
85
+ import pandas as pd
86
+ from datasets import Dataset, load_dataset, concatenate_datasets
87
+
88
+ ### Tagengo
89
+ gpt4_dataset = load_dataset("lightblue/tagengo-gpt4", split="train")
90
+ gpt4_dataset = gpt4_dataset.filter(lambda x: x["response"][1] == "stop")
91
+ ####
92
+
93
+ ### Megagon
94
+ megagon_df = pd.read_json(
95
+ "https://raw.githubusercontent.com/megagonlabs/instruction_ja/main/data/data.jsonl",
96
+ lines=True,
97
+ orient="records"
98
+ )
99
+ role_map = {"user": "human", "agent": "gpt"}
100
+ megagon_df["conversations"] = megagon_df.utterances.apply(lambda x: [{"from": role_map[y["name"]], "value": y["text"]} for y in x])
101
+ megagon_df["language"] = "Japanese"
102
+ megagon_df = megagon_df[["conversations", "language"]]
103
+ megagon_dataset = Dataset.from_pandas(df)
104
+ ###
105
+
106
+ ### Openchat
107
+ openchat_df = pd.read_json("https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json?download=true")
108
+ openchat_df["conversations"] = openchat_df["items"]
109
+ openchat_dataset = Dataset.from_pandas(openchat_df)
110
+ ###
111
+
112
+
113
+ dataset = concatenate_datasets([gpt4_dataset, megagon_dataset, openchat_dataset])
114
+ dataset = dataset.filter(lambda x: not any([y["value"] is None for y in x["conversations"]]))
115
+ dataset.select_columns(["conversations"]).to_json("/workspace/llm_training/axolotl/llama3-multilingual/tagengo_openchat_megagon.json")
116
+ ```
117
+
118
+ </details>
119
+ <br/>
120
+
121
+ # workspace/llm_training/axolotl/llama3-multilingual/output_tagengo_openchat_megagon_8B_llama3
122
+
123
+ This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on the above described dataset.
124
+ It achieves the following results on the evaluation set:
125
+ - Loss: 0.6595
126
+
127
+
128
+ ## Training procedure
129
+
130
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
131
+ should probably proofread and complete it, then remove this comment. -->
132
+
133
+ [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
134
+ <details><summary>See axolotl config</summary>
135
+
136
+ axolotl version: `0.4.0`
137
+ ```yaml
138
+ base_model: meta-llama/Meta-Llama-3-8B-Instruct
139
+ model_type: LlamaForCausalLM
140
+ tokenizer_type: AutoTokenizer # PreTrainedTokenizerFast
141
+
142
+ load_in_8bit: false
143
+ load_in_4bit: false
144
+ strict: false
145
+
146
+ datasets:
147
+ - path: /workspace/llm_training/axolotl/llama3-multilingual/tagengo_openchat_megagon.json
148
+ ds_type: json # see other options below
149
+ type: sharegpt
150
+ conversation: llama-3
151
+ dataset_prepared_path: /workspace/llm_training/axolotl/llama3-multilingual/prepared_tagengo_openchat_megagon
152
+ val_set_size: 0.01
153
+ output_dir: /workspace/llm_training/axolotl/llama3-multilingual/output_tagengo_openchat_megagon_8B_llama3
154
+
155
+ sequence_len: 8192
156
+ sample_packing: true
157
+ pad_to_sequence_len: true
158
+
159
+ use_wandb: true
160
+ wandb_project: wandb_project
161
+ wandb_entity: wandb_entity
162
+ wandb_name: wandb_name
163
+
164
+ gradient_accumulation_steps: 2
165
+ micro_batch_size: 2
166
+ num_epochs: 1
167
+ optimizer: paged_adamw_8bit
168
+ lr_scheduler: cosine
169
+ learning_rate: 1e-5
170
+
171
+ train_on_inputs: false
172
+ group_by_length: false
173
+ bf16: auto
174
+ fp16:
175
+ tf32: false
176
+
177
+ gradient_checkpointing: true
178
+ gradient_checkpointing_kwargs:
179
+ use_reentrant: false
180
+ early_stopping_patience:
181
+ resume_from_checkpoint:
182
+ logging_steps: 1
183
+ xformers_attention:
184
+ flash_attention: true
185
+
186
+ warmup_steps: 10
187
+ evals_per_epoch: 5
188
+ eval_table_size:
189
+ saves_per_epoch: 1
190
+ debug:
191
+ deepspeed: /workspace/axolotl/deepspeed_configs/zero2.json
192
+ weight_decay: 0.0
193
+ special_tokens:
194
+ pad_token: <|end_of_text|>
195
+ ```
196
+
197
+ </details><br>
198
+
199
+ <details><summary>Note - we added this Llama 3 template to fastchat directly as the Llama 3 chat template was not supported when we trained this model.</summary>
200
+
201
+ ```python
202
+ from fastchat.conversation import Conversation
203
+ from fastchat.conversation import register_conv_template
204
+ from fastchat.conversation import SeparatorStyle
205
+
206
+ register_conv_template(
207
+ Conversation(
208
+ name="llama-3",
209
+ system_template="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}",
210
+ roles=("<|start_header_id|>user<|end_header_id|>\n", "<|start_header_id|>assistant<|end_header_id|>\n"),
211
+ sep_style=SeparatorStyle.ADD_NEW_LINE_SINGLE,
212
+ sep="<|eot_id|>",
213
+ stop_token_ids=[128009],
214
+ stop_str="<|eot_id|>",
215
+ )
216
+ )
217
+ ```
218
+
219
+ </details><br>
220
+
221
+
222
+ ### Training hyperparameters
223
+
224
+ The following hyperparameters were used during training:
225
+ - learning_rate: 1e-05
226
+ - train_batch_size: 2
227
+ - eval_batch_size: 2
228
+ - seed: 42
229
+ - distributed_type: multi-GPU
230
+ - num_devices: 4
231
+ - gradient_accumulation_steps: 2
232
+ - total_train_batch_size: 16
233
+ - total_eval_batch_size: 8
234
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
235
+ - lr_scheduler_type: cosine
236
+ - lr_scheduler_warmup_steps: 10
237
+ - num_epochs: 1
238
+
239
+ ### Training results
240
+
241
+ | Training Loss | Epoch | Step | Validation Loss |
242
+ |:-------------:|:-----:|:----:|:---------------:|
243
+ | 1.1894 | 0.0 | 1 | 1.0110 |
244
+ | 0.8493 | 0.2 | 73 | 0.7057 |
245
+ | 0.8047 | 0.4 | 146 | 0.6835 |
246
+ | 0.7644 | 0.6 | 219 | 0.6687 |
247
+ | 0.7528 | 0.8 | 292 | 0.6615 |
248
+ | 0.7794 | 1.0 | 365 | 0.6595 |
249
+
250
+
251
+ ### Framework versions
252
+
253
+ - Transformers 4.38.2
254
+ - Pytorch 2.2.1+cu121
255
+ - Datasets 2.18.0
256
+ - Tokenizers 0.15.0