--- base_model: BeaverAI/Theia-21B-v2a tags: - axolotl - generated_from_trainer model-index: - name: Theia-21B-v2b-WS results: [] --- [

](https://github.com/axolotl-ai-cloud/axolotl)

See axolotl config

axolotl version: `0.4.1` ```yaml base_model: BeaverAI/Theia-21B-v2a model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false strict: false sequence_len: 16384 bf16: auto fp16: tf32: false flash_attention: true special_tokens: pad_token: "" tokens: - "<|im_start|>" - "<|im_end|>" # PLAN # instruct = mistral # siayn = alpaca # RP = chatml # Data datasets: - path: BeaverAI/saosauce-v2-creative-only type: sharegpt conversation: chatml warmup_steps: 150 save_safetensors: true mlflow_tracking_uri: http://127.0.0.1:7860 mlflow_experiment_name: Default # WandB #wandb_project: theia #wandb_entity: # Iterations num_epochs: 2 # Output output_dir: ./Theia-21B-v2b-Workspace hub_model_id: BeaverAI/Theia-21B-v2b-WS hub_strategy: "all_checkpoints" # Sampling sample_packing: true pad_to_sequence_len: true # Batching gradient_accumulation_steps: 1 micro_batch_size: 1 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: true # Evaluation val_set_size: 0.025 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 256 eval_sample_packing: false eval_batch_size: 1 # Optimizer optimizer: paged_adamw_8bit # adamw_8bit lr_scheduler: cosine learning_rate: 0.0000025 lr_scheduler: cosine_with_min_lr lr_scheduler_kwargs: min_lr: 0.00000025 weight_decay: 0.01 max_grad_norm: 10.0 # Misc train_on_inputs: false group_by_length: false early_stopping_patience: local_rank: logging_steps: 1 xformers_attention: debug: deepspeed: deepspeed_configs/zero3_bf16.json # previously blank fsdp: fsdp_config: # Checkpoints resume_from_checkpoint: saves_per_epoch: 1 ```

# Theia-21B-v2b-WS This model is a fine-tuned version of [BeaverAI/Theia-21B-v2a](https://huggingface.co/BeaverAI/Theia-21B-v2a) on the None dataset. It achieves the following results on the evaluation set: - Loss: 1.0491 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2.5e-06 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 8 - total_train_batch_size: 8 - total_eval_batch_size: 8 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine_with_min_lr - lr_scheduler_warmup_steps: 150 - num_epochs: 2 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 1.5956 | 0.0030 | 1 | 1.5014 | | 1.2309 | 0.2508 | 83 | 1.2064 | | 1.1063 | 0.5015 | 166 | 1.1376 | | 1.1885 | 0.7523 | 249 | 1.1021 | | 1.1518 | 1.0030 | 332 | 1.0772 | | 0.9328 | 1.2356 | 415 | 1.0693 | | 0.9149 | 1.4864 | 498 | 1.0569 | | 0.9377 | 1.7372 | 581 | 1.0491 | ### Framework versions - Transformers 4.44.0.dev0 - Pytorch 2.2.2 - Datasets 2.19.1 - Tokenizers 0.19.1