shisa-v2 Base Model ablation

Using a fork of Lightblue's Shaberi benchmark framework:

Model Average ELYZA-tasks-100 MT-Bench Rakuda Tengu-Bench
gpt-4-turbo-2024-04-09 8.75 8.78 8.74 9.18 8.31
CohereForAI/c4ai-command-r-plus 7.69 7.50 7.43 9.05 6.79
gpt-3.5-turbo-0125 7.17 7.24 6.98 7.64 6.82
shisa-ai/shisa-v1-llama3-70b 7.17 7.16 7.45 7.98 6.09
karakuri-ai/karakuri-lm-70b-chat-v0.1 6.84 6.86 6.43 7.85 6.23
lightblue/ao-karasu-72B 6.81 7.19 6.54 7.25 6.27
shisa-ai/shisa-v1-llama3-8b^ 6.29 6.62 6.41 7.05 5.07
shisa-ai/shisa-swallowmx-13a47b-v1 6.17 6.48 6.07 7.11 5.03
shisa-ai/shisa-v1-llama3-8b 6.10 6.52 6.20 6.37 5.33
Rakuten/RakutenAI-7B-chat 5.58 5.92 4.60 6.58 5.24
shisa-ai/shisa-v1-gemma-8b 5.64 6.50 5.42 5.10 5.55
augmxnt/shisa-gamma-7b-v1 5.56 5.84 4.00 6.73 5.68
lightblue/qarasu-14B-chat-plus-unleashed 5.20 5.58 4.74 5.46 5.01
cyberagent/calm2-7b-chat 4.76 4.90 3.58 5.75 4.81
mistralai/Mistral-7B-Instruct-v0.2 4.69 5.78 4.65 3.80 4.53
shisa-ai/shisa-v1-yi1.5-9b 4.63 5.98 4.28 3.26 5.00

^ Sampler settings: temperature 0.2, min_p 0.1, frequency_penalty 0.5

Built with Axolotl

See axolotl config

axolotl version: 0.4.0

base_model: tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: inst
datasets:
  - path: augmxnt/ultra-orca-boros-en-ja-v1
    type: sharegpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/basemodel-swallowmx-8x22b

model_config:
  output_router_logits: true

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

use_wandb: true
wandb_project: shisa-v2
wandb_entity: augmxnt
wandb_name: shisa-swallowmx-13a47b-v1

global_batch_size: 1
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 3
# https://github.com/huggingface/transformers/issues/22101
# https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L141
optimizer: paged_adamw_8bit
lr_scheduler: linear
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

outputs/basemodel-swallowmx-8x22b

This model is a fine-tuned version of tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4443

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 119
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss
0.5705 0.0022 1 0.5065
0.505 0.4993 229 0.3910
0.5258 0.9986 458 0.3654
0.2964 1.4835 687 0.3786
0.2923 1.9828 916 0.3669
0.1462 2.4682 1145 0.4429
0.1156 2.9676 1374 0.4443

Framework versions

  • Transformers 4.40.2
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1
Downloads last month
20
Safetensors
Model size
46.7B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for shisa-ai/shisa-v1-swallowmx-13a47b

Finetuned
(5)
this model

Dataset used to train shisa-ai/shisa-v1-swallowmx-13a47b