Edit model card

smol_llama-220M-GQA-fineweb-edu-10BT

This model is a continously pretrained version of BEE-spoke-data/smol_llama-220M-GQA on the 10BT-sample subset of HuggingFaceFW/fineweb-edu.

It achieves the following results on the evaluation set:

  • Loss: 2.7416
  • Accuracy: 0.4560
  • Num Input Tokens Seen: 10810818560

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 80085
  • gradient_accumulation_steps: 32
  • total_train_batch_size: 256
  • optimizer: Adam with betas=(0.9,0.95) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.05
  • num_epochs: 1.0

Training results

Training Loss Epoch Step Validation Loss Accuracy Input Tokens Seen
2.8567 0.0145 300 2.8291 0.4450 157286400
2.8517 0.0291 600 2.8153 0.4465 314572800
2.8224 0.0436 900 2.8025 0.4481 471859200
2.8178 0.0582 1200 2.7912 0.4495 629145600
2.8001 0.0727 1500 2.7832 0.4505 786432000
2.8045 0.0873 1800 2.7772 0.4512 943718400
2.8019 0.1018 2100 2.7729 0.4516 1101004800
2.7995 0.1164 2400 2.7691 0.4522 1258291200
2.8006 0.1309 2700 2.7657 0.4526 1415577600
2.7886 0.1455 3000 2.7631 0.4528 1572864000
2.7907 0.1600 3300 2.7606 0.4532 1730150400
2.7907 0.1746 3600 2.7588 0.4536 1887436800
2.7788 0.1891 3900 2.7569 0.4537 2044723200
2.7942 0.2037 4200 2.7552 0.4540 2202009600
2.793 0.2182 4500 2.7538 0.4543 2359296000
2.7958 0.2328 4800 2.7526 0.4544 2516582400
2.78 0.2473 5100 2.7515 0.4547 2673868800
2.7937 0.2619 5400 2.7506 0.4548 2831155200
2.7717 0.2764 5700 2.7498 0.4548 2988441600
2.7832 0.2910 6000 2.7490 0.4548 3145728000
2.768 0.3055 6300 2.7482 0.4550 3303014400
2.7653 0.3201 6600 2.7476 0.4551 3460300800
2.7843 0.3346 6900 2.7470 0.4551 3617587200
2.7765 0.3492 7200 2.7464 0.4550 3774873600
2.7778 0.3637 7500 2.7460 0.4552 3932160000
2.7655 0.3783 7800 2.7455 0.4553 4089446400
2.7943 0.3928 8100 2.7449 0.4554 4246732800
2.7715 0.4074 8400 2.7447 0.4552 4404019200
2.7828 0.4219 8700 2.7443 0.4554 4561305600
2.7883 0.4365 9000 2.7440 0.4556 4718592000
2.7627 0.4510 9300 2.7437 0.4556 4875878400
2.7841 0.4656 9600 2.7435 0.4557 5033164800
2.7734 0.4801 9900 2.7433 0.4557 5190451200
2.7829 0.4947 10200 2.7430 0.4557 5347737600
2.781 0.5092 10500 2.7429 0.4557 5505024000
2.7757 0.5238 10800 2.7428 0.4557 5662310400
2.779 0.5383 11100 2.7426 0.4559 5819596800
2.7771 0.5529 11400 2.7425 0.4559 5976883200
2.7828 0.5674 11700 2.7424 0.4560 6134169600
2.7814 0.5820 12000 2.7423 0.4558 6291456000
2.7735 0.5965 12300 2.7422 0.4559 6448742400
2.7848 0.6111 12600 2.7420 0.4559 6606028800
2.7748 0.6256 12900 2.7420 0.4559 6763315200
2.7697 0.6402 13200 2.7419 0.4560 6920601600
2.7689 0.6547 13500 2.7419 0.4560 7077888000
2.7747 0.6692 13800 2.7419 0.4559 7235174400
2.786 0.6838 14100 2.7418 0.4561 7392460800
2.7801 0.6983 14400 2.7417 0.4560 7549747200
2.7658 0.7129 14700 2.7417 0.4561 7707033600
2.7717 0.7274 15000 2.7417 0.4560 7864320000
2.7717 0.7420 15300 2.7417 0.4560 8021606400
2.777 0.7565 15600 2.7417 0.4559 8178892800
2.7793 0.7711 15900 2.7416 0.4560 8336179200
2.7718 0.7856 16200 2.7416 0.4559 8493465600
2.7757 0.8002 16500 2.7416 0.4560 8650752000
2.7763 0.8147 16800 2.7416 0.4559 8808038400
2.7581 0.8293 17100 2.7416 0.4559 8965324800
2.7719 0.8438 17400 2.7416 0.4560 9122611200
2.7609 0.8584 17700 2.7416 0.4560 9279897600
2.7753 0.8729 18000 2.7416 0.4559 9437184000
2.7674 0.8875 18300 2.7415 0.4560 9594470400
2.7601 0.9020 18600 2.7416 0.4560 9751756800
2.7823 0.9166 18900 2.7416 0.4560 9909043200
2.7767 0.9311 19200 2.7416 0.4560 10066329600
2.7759 0.9457 19500 2.7416 0.4560 10223616000
2.7722 0.9602 19800 2.7415 0.4560 10380902400
2.7764 0.9748 20100 2.7416 0.4560 10538188800
2.7724 0.9893 20400 2.7416 0.4559 10695475200

Framework versions

  • Transformers 4.41.1
  • Pytorch 2.3.1+cu118
  • Datasets 2.19.1
  • Tokenizers 0.19.1
Downloads last month
127
Safetensors
Model size
218M params
Tensor type
BF16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Finetuned from

Dataset used to train BEE-spoke-data/smol_llama-220M-GQA-fineweb_edu