Theia-21B-v2 / README.md
TheDrummer's picture
Upload folder using huggingface_hub
8603224 verified
|
raw
history blame
No virus
3.73 kB
metadata
base_model: BeaverAI/Theia-21B-v2a
tags:
  - axolotl
  - generated_from_trainer
model-index:
  - name: Theia-21B-v2b-WS
    results: []

Built with Axolotl

See axolotl config

axolotl version: 0.4.1

base_model: BeaverAI/Theia-21B-v2a
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false
sequence_len: 16384
bf16: auto
fp16:
tf32: false
flash_attention: true
special_tokens:
    pad_token: "<pad>"
tokens:
    - "<|im_start|>"
    - "<|im_end|>"

# PLAN
# instruct = mistral
# siayn = alpaca
# RP = chatml

# Data
datasets:
  - path: BeaverAI/saosauce-v2-creative-only
    type: sharegpt
    conversation: chatml
warmup_steps: 150

save_safetensors: true

mlflow_tracking_uri: http://127.0.0.1:7860
mlflow_experiment_name: Default
# WandB
#wandb_project: theia
#wandb_entity:

# Iterations
num_epochs: 2

# Output
output_dir: ./Theia-21B-v2b-Workspace
hub_model_id: BeaverAI/Theia-21B-v2b-WS
hub_strategy: "all_checkpoints"

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 1
micro_batch_size: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
   use_reentrant: true

# Evaluation
val_set_size: 0.025
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 256
eval_sample_packing: false
eval_batch_size: 1

# Optimizer
optimizer: paged_adamw_8bit # adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0000025
lr_scheduler: cosine_with_min_lr
lr_scheduler_kwargs:
    min_lr: 0.00000025
weight_decay: 0.01
max_grad_norm: 10.0

# Misc
train_on_inputs: false
group_by_length: false
early_stopping_patience:
local_rank:
logging_steps: 1
xformers_attention:
debug:
deepspeed: deepspeed_configs/zero3_bf16.json # previously blank
fsdp:
fsdp_config:

# Checkpoints
resume_from_checkpoint:
saves_per_epoch: 1

Theia-21B-v2b-WS

This model is a fine-tuned version of BeaverAI/Theia-21B-v2a on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 1.0491

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2.5e-06
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 8
  • total_eval_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine_with_min_lr
  • lr_scheduler_warmup_steps: 150
  • num_epochs: 2

Training results

Training Loss Epoch Step Validation Loss
1.5956 0.0030 1 1.5014
1.2309 0.2508 83 1.2064
1.1063 0.5015 166 1.1376
1.1885 0.7523 249 1.1021
1.1518 1.0030 332 1.0772
0.9328 1.2356 415 1.0693
0.9149 1.4864 498 1.0569
0.9377 1.7372 581 1.0491

Framework versions

  • Transformers 4.44.0.dev0
  • Pytorch 2.2.2
  • Datasets 2.19.1
  • Tokenizers 0.19.1