shisa-v2 Base Model ablation

Using a fork of Lightblue's Shaberi benchmark framework:

Model	Average	ELYZA-tasks-100	MT-Bench	Rakuda	Tengu-Bench
gpt-4-turbo-2024-04-09	8.75	8.78	8.74	9.18	8.31
CohereForAI/c4ai-command-r-plus	7.69	7.50	7.43	9.05	6.79
gpt-3.5-turbo-0125	7.17	7.24	6.98	7.64	6.82
shisa-ai/shisa-v1-llama3-70b	7.17	7.16	7.45	7.98	6.09
karakuri-ai/karakuri-lm-70b-chat-v0.1	6.84	6.86	6.43	7.85	6.23
lightblue/ao-karasu-72B	6.81	7.19	6.54	7.25	6.27
shisa-ai/shisa-v1-llama3-8b^	6.29	6.62	6.41	7.05	5.07
shisa-ai/shisa-swallowmx-13a47b-v1	6.17	6.48	6.07	7.11	5.03
shisa-ai/shisa-v1-llama3-8b	6.10	6.52	6.20	6.37	5.33
Rakuten/RakutenAI-7B-chat	5.58	5.92	4.60	6.58	5.24
shisa-ai/shisa-v1-gemma-8b	5.64	6.50	5.42	5.10	5.55
augmxnt/shisa-gamma-7b-v1	5.56	5.84	4.00	6.73	5.68
lightblue/qarasu-14B-chat-plus-unleashed	5.20	5.58	4.74	5.46	5.01
cyberagent/calm2-7b-chat	4.76	4.90	3.58	5.75	4.81
mistralai/Mistral-7B-Instruct-v0.2	4.69	5.78	4.65	3.80	4.53
shisa-ai/shisa-v1-yi1.5-9b	4.63	5.98	4.28	3.26	5.00

^ Sampler settings: temperature 0.2, min_p 0.1, frequency_penalty 0.5

See axolotl config

axolotl version: 0.4.0

base_model: tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: inst
datasets:
  - path: augmxnt/ultra-orca-boros-en-ja-v1
    type: sharegpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/basemodel-swallowmx-8x22b

model_config:
  output_router_logits: true

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

use_wandb: true
wandb_project: shisa-v2
wandb_entity: augmxnt
wandb_name: shisa-swallowmx-13a47b-v1

global_batch_size: 1
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 3
# https://github.com/huggingface/transformers/issues/22101
# https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L141
optimizer: paged_adamw_8bit
lr_scheduler: linear
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

outputs/basemodel-swallowmx-8x22b

This model is a fine-tuned version of tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.4443

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 119
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
0.5705	0.0022	1	0.5065
0.505	0.4993	229	0.3910
0.5258	0.9986	458	0.3654
0.2964	1.4835	687	0.3786
0.2923	1.9828	916	0.3669
0.1462	2.4682	1145	0.4429
0.1156	2.9676	1374	0.4443

Framework versions

Transformers 4.40.2
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

shisa-ai
/

shisa-v1-swallowmx-13a47b

outputs/basemodel-swallowmx-8x22b

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for shisa-ai/shisa-v1-swallowmx-13a47b

Dataset used to train shisa-ai/shisa-v1-swallowmx-13a47b