metadata

base_model: BeaverAI/Theia-21B-v2a
tags:
  - axolotl
  - generated_from_trainer
model-index:
  - name: Theia-21B-v2b-WS
    results: []

See axolotl config

axolotl version: 0.4.1

base_model: BeaverAI/Theia-21B-v2a
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false
sequence_len: 16384
bf16: auto
fp16:
tf32: false
flash_attention: true
special_tokens:
    pad_token: "<pad>"
tokens:
    - "<|im_start|>"
    - "<|im_end|>"

# PLAN
# instruct = mistral
# siayn = alpaca
# RP = chatml

# Data
datasets:
  - path: BeaverAI/saosauce-v2-creative-only
    type: sharegpt
    conversation: chatml
warmup_steps: 150

save_safetensors: true

mlflow_tracking_uri: http://127.0.0.1:7860
mlflow_experiment_name: Default
# WandB
#wandb_project: theia
#wandb_entity:

# Iterations
num_epochs: 2

# Output
output_dir: ./Theia-21B-v2b-Workspace
hub_model_id: BeaverAI/Theia-21B-v2b-WS
hub_strategy: "all_checkpoints"

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 1
micro_batch_size: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
   use_reentrant: true

# Evaluation
val_set_size: 0.025
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 256
eval_sample_packing: false
eval_batch_size: 1

# Optimizer
optimizer: paged_adamw_8bit # adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0000025
lr_scheduler: cosine_with_min_lr
lr_scheduler_kwargs:
    min_lr: 0.00000025
weight_decay: 0.01
max_grad_norm: 10.0

# Misc
train_on_inputs: false
group_by_length: false
early_stopping_patience:
local_rank:
logging_steps: 1
xformers_attention:
debug:
deepspeed: deepspeed_configs/zero3_bf16.json # previously blank
fsdp:
fsdp_config:

# Checkpoints
resume_from_checkpoint:
saves_per_epoch: 1

Theia-21B-v2b-WS

This model is a fine-tuned version of BeaverAI/Theia-21B-v2a on the None dataset. It achieves the following results on the evaluation set:

Loss: 1.0491

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2.5e-06
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 8
total_eval_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_warmup_steps: 150
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss
1.5956	0.0030	1	1.5014
1.2309	0.2508	83	1.2064
1.1063	0.5015	166	1.1376
1.1885	0.7523	249	1.1021
1.1518	1.0030	332	1.0772
0.9328	1.2356	415	1.0693
0.9149	1.4864	498	1.0569
0.9377	1.7372	581	1.0491

Framework versions

Transformers 4.44.0.dev0
Pytorch 2.2.2
Datasets 2.19.1
Tokenizers 0.19.1