File size: 7,341 Bytes
dd9dc6a
 
 
 
 
 
 
 
916206c
 
dd9dc6a
 
093800e
fc5c300
d012cd3
fec208d
093800e
35add1d
d012cd3
093800e
35add1d
093800e
35add1d
d012cd3
35add1d
382729b
093800e
 
 
 
fc5c300
 
 
 
d012cd3
 
 
fc5c300
 
 
d012cd3
093800e
fc5c300
 
d012cd3
1b5a059
853e9cb
d012cd3
 
 
 
fc5c300
 
d012cd3
fc5c300
 
 
d012cd3
fc5c300
dd9dc6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8391e8
dd9dc6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8391e8
dd9dc6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
916206c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
license: llama3
base_model: meta-llama/Meta-Llama-3-70B-Instruct
tags:
- generated_from_trainer
model-index:
- name: outputs/basemodel-llama3-70b.8e6
  results: []
datasets:
- augmxnt/ultra-orca-boros-en-ja-v1
---

# shisa-v2 Base Model ablation

*Per the  Llama 3 Community License Agreement, the official name of this model is "Llama 3 shisa-v1-llama3-70b"*

This is a fine-tune Llama 3 70B Instruct with the primary `shisa-v1` dataset to improve Japanese language capabilities.

This model uses a LR of 8e-6 that slightly improves performance vs the initial 2e-5 tune (based on and validating predictive power of the the
results of the Llama 3 8B LR ablations).

It also uses NEFTune, although the expected impact is neglible for this dataset.

While the 2e-5 model matched gpt-3.5-turbo performance, this 2e-6 version consistently edges it out, so I think it's fair to say that this model "beats" it.

While this is merely a test ablation on the road to `shisa-v2`, as of its release (mid-May 2024), it's the strongest commercially-usable open JA model benchmarked so far, so this model may be of general interest. 


## Performance
Measured using a [fork](https://github.com/shisa-ai/shaberi) of [Lightblue's Shaberi benchmark framework](https://github.com/lightblue-tech/japanese_llm_eval):

| Model                                  | Average | ELYZA-tasks-100 | MT-Bench | Rakuda | Tengu-Bench |
|----------------------------------------|---------|-----------------|----------|--------|-------------|
| gpt-4-turbo-2024-04-09                 | 8.75    | 8.78            | 8.74     | 9.18   | 8.31        |
| gpt-4o-2024-05-13                      | 8.72    | 8.88            | 8.69     | 9.15   | 8.16        |
| gemini-1.5-pro                         | 8.58    | 8.58            | 8.93     | 9.20   | 7.61        |
| claude-3-opus-20240229                 | 8.55    | 8.64            | 8.58     | 8.75   | 8.23        |
| CohereForAI/c4ai-command-r-plus        | 7.69    | 7.50            | 7.43     | 9.05   | 6.79        |
| **shisa-ai/shisa-v1-llama3-70b**       | **7.30**| **7.34**        | **7.67** | **8.15** | **6.04**  |
| gpt-3.5-turbo-0125                     | 7.17    | 7.24            | 6.98     | 7.64   | 6.82        |
| **shisa-ai/shisa-v1-llama3-70b.2e5**   | **7.17**| **7.16**        | **7.45** | **7.98** | **6.09**  |
| karakuri-ai/karakuri-lm-8x7b-chat-v0.1 | 7.00    | 7.18            | 6.30     | 7.98   | 6.55        |
| karakuri-ai/karakuri-lm-70b-chat-v0.1  | 6.84    | 6.86            | 6.43     | 7.85   | 6.23        |
| lightblue/ao-karasu-72B                | 6.81    | 7.19            | 6.54     | 7.25   | 6.27        |
| **shisa-ai/shisa-v1-llama3-8b**        | **6.59**| **6.67**        | **6.95** | **7.05**| **5.68**   |
| microsoft/Phi-3-medium-128k-instruct   | 6.48    | 7.10            | 5.92     | 6.84   | 6.04        | 
| **shisa-ai/shisa-v1-swallowmx-13a47b** | **6.17**| **6.48**        | **6.07** | **7.11**| **5.03**   |
| lightblue/suzume-llama-3-8B-japanese   | 5.96    | 6.68            | 4.96     | 6.68   | 5.53        |
| augmxnt/shisa-gamma-7b-v1              | 5.82    | 5.96            | 5.02     | 6.85   | 5.47        |
| **shisa-ai/shisa-v1-phi3-14b**         | **5.77**| **6.28**        | **5.26** | **6.55**| **5.01**   |
| **shisa-ai/shisa-v1-gemma-8b**         | **5.64**| **6.50**        | **5.42** | **5.10**| **5.55**   |
| Rakuten/RakutenAI-7B-chat              | 5.58    | 5.92            | 4.60     | 6.58   | 5.24        |
| lightblue/qarasu-14B-chat-plus-unleashed | 5.20  | 5.58            | 4.74     | 5.46   | 5.01        |
| **shisa-ai/shisa-v1-mistral0.3-7b**    | **5.11**| **5.64**        | **6.10** | **3.83**|**4.86**    |
| cyberagent/calm2-7b-chat               | 4.76    | 4.90            | 3.58     | 5.75   | 4.81        |
| mistralai/Mistral-7B-Instruct-v0.2     | 4.69    | 5.78            | 4.65     | 3.80   | 4.53        |
| **shisa-ai/shisa-v1-yi1.5-9b**         | **4.63**| **5.98**        | **4.28** | **3.26**|**5.00**    |
| augmxnt/shisa-7b-v1                    | 4.50    | 4.63            | 3.95     | 4.89   | 4.53        |

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
<details><summary>See axolotl config</summary>

axolotl version: `0.4.0`
```yaml
base_model: meta-llama/Meta-Llama-3-70B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

# doesn't work...
# hub_model_id: shisa-ai/shisa-llama3-70b-v1
# hub_strategy: end

use_wandb: true
wandb_project: shisa-v2
wandb_entity: augmxnt
wandb_name: shisa-llama3-70b-v1.8e6

chat_template: llama3
datasets:
  - path: augmxnt/ultra-orca-boros-en-ja-v1
    type: sharegpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/basemodel-llama3-70b.8e6

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

neftune_noise_alpha: 5

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: linear
learning_rate: 8e-6

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 0
debug:
deepspeed: axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.05
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>

```

</details><br>

# outputs/basemodel-llama3-70b.8e6

This model is a fine-tuned version of [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) on the None dataset.
It achieves the following results on the evaluation set:
- Loss: 0.4440

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 8e-6
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 16
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 87
- num_epochs: 3

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 1.248         | 0.0033 | 1    | 0.7102          |
| 0.7497        | 0.5008 | 154  | 0.4374          |
| 0.7229        | 1.0016 | 308  | 0.3940          |
| 0.3772        | 1.4862 | 462  | 0.3962          |
| 0.3791        | 1.9870 | 616  | 0.3838          |
| 0.0943        | 2.4699 | 770  | 0.4440          |


### Framework versions

- Transformers 4.40.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1