See axolotl config
axolotl version: 0.5.2
base_model: meta-llama/Llama-3.1-8B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
tokenizer_use_fast: false
resize_token_embeddings_to_32x: false
flash_attention: true
xformers_attention:
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: skymizer/Llama3.1-8B-base-tokenized-fineweb-edu-45B-4096
train_on_split: train
type: completion
test_datasets:
- path: skymizer/Llama3.1-8B-base-tokenized-fineweb-edu-test-4K
split: test
type: completion
is_preprocess: true
skip_prepare_dataset: true
dataset_prepared_path: /mnt/home/model-team/datasets/pretokenized/Llama3.1-8B-base-tokenized-fineweb-edu-45B-4096
hf_use_auth_token: true
output_dir: /mnt/home/model-team/models/Llama3.1-8B-v0.1-relu-stage-1-fineweb-edu-45B-4096
resume_from_checkpoint:
auto_resume_from_checkpoints: true
sequence_len: 4096
sample_packing: true
sample_packing_group_size: 100000
sample_packing_bin_size: 200
pad_to_sequence_len: true
eval_sample_packing: false
# eval_causal_lm_metrics: ["perplexity"]
wandb_project: "sparse-tuning-cpt"
wandb_entity:
wandb_watch:
wandb_name: "Llama3.1-8B-relu-stage-1-fineweb-edu-45B-4096"
wandb_log_model:
# global batch size = 2 * 8 * 8 GPUs * 8 Nodes * 4096 = 4M
gradient_accumulation_steps: 8
micro_batch_size: 2
# eval_batch_size: 2
max_steps: 10000
optimizer: adamw_torch
learning_rate: 0.000015
lr_scheduler: cosine
cosine_min_lr_ratio: 1.0
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 0.000001
max_grad_norm: 1.0
train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false
hub_model_id: "skymizer/Llama3.1-8B-relu-stage-1-fineweb-edu-45B-4096"
save_strategy: "steps"
save_steps: 500
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
warmup_steps: 1
eval_steps: 500
eval_table_size:
debug:
deepspeed: /root/train/axolotl/deepspeed_configs/zero3_bf16.json
fsdp:
fsdp_config:
seed: 42
special_tokens:
pad_token: "<|end_of_text|>"
Llama3.1-8B-relu-stage-1-fineweb-edu-45B-4096
This model is a fine-tuned version of meta-llama/Llama-3.1-8B on an unknown dataset. It achieves the following results on the evaluation set:
- Loss: 1.9682
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1.5e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 64
- gradient_accumulation_steps: 8
- total_train_batch_size: 1024
- total_eval_batch_size: 128
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 2
- training_steps: 10000
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
12.2232 | 0.0001 | 1 | 12.1487 |
2.2025 | 0.0424 | 500 | 2.2272 |
2.1454 | 0.0848 | 1000 | 2.1515 |
2.0991 | 0.1273 | 1500 | 2.1142 |
2.0604 | 0.1697 | 2000 | 2.0894 |
2.058 | 0.2121 | 2500 | 2.0711 |
2.0582 | 0.2545 | 3000 | 2.0561 |
2.0474 | 0.2969 | 3500 | 2.0442 |
2.0268 | 0.3394 | 4000 | 2.0347 |
2.0173 | 0.3818 | 4500 | 2.0256 |
1.9941 | 0.4242 | 5000 | 2.0178 |
2.0113 | 0.4666 | 5500 | 2.0106 |
1.9949 | 0.5091 | 6000 | 2.0040 |
2.0077 | 0.5515 | 6500 | 1.9984 |
1.986 | 0.5939 | 7000 | 1.9935 |
1.9902 | 0.6363 | 7500 | 1.9888 |
1.9899 | 0.6787 | 8000 | 1.9841 |
1.9729 | 0.7212 | 8500 | 1.9800 |
1.971 | 0.7636 | 9000 | 1.9759 |
1.9784 | 0.8060 | 9500 | 1.9718 |
1.9553 | 0.8484 | 10000 | 1.9682 |
Framework versions
- Transformers 4.46.3
- Pytorch 2.5.1+cu124
- Datasets 3.1.0
- Tokenizers 0.20.3
- Downloads last month
- 168
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.