---
license: apache-2.0
base_model: pszemraj/griffin-1024-llama3t-8layer-simplewiki-silu
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: griffin-1024-llama3t-8layer-simplewiki-silu-fineweb-1M_en-med-vN
  results: []
datasets:
- BEE-spoke-data/fineweb-1M_en-med
language:
- en
---

# griffin-llama3t-8L-v0.02-fineweb

Pretraining experiment with griffin/recurrent_gemma arch. This one uses the Llama-3 tokenizer.

## Model description

Further training of [pszemraj/griffin-1024-llama3t-8layer-simplewiki-silu](https://huggingface.co/pszemraj/griffin-1024-llama3t-8layer-simplewiki-silu) on the BEE-spoke-data/fineweb-1M_en-med dataset.
It achieves the following results on the evaluation set:
- Loss: 5.6538
- Accuracy: 0.1881
- Num Input Tokens Seen: 766509056

## evals

tl;dr its bad/would need more training:


hf (pretrained=pszemraj/griffin-llama3t-8L-v0.02-fineweb,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4

|    Tasks     |Version|Filter|n-shot|  Metric  |   Value   |   |  Stderr  |
|--------------|------:|------|-----:|----------|----------:|---|---------:|
|winogrande    |      1|none  |     0|acc       |     0.4964|±  |    0.0141|
|piqa          |      1|none  |     0|acc       |     0.5332|±  |    0.0116|
|              |       |none  |     0|acc_norm  |     0.5299|±  |    0.0116|
|openbookqa    |      1|none  |     0|acc       |     0.1280|±  |    0.0150|
|              |       |none  |     0|acc_norm  |     0.2320|±  |    0.0189|
|lambada_openai|      1|none  |     0|perplexity|638060.0702|±  |43608.0044|
|              |       |none  |     0|acc       |     0.0000|±  |    0.0000|
|boolq         |      2|none  |     0|acc       |     0.3783|±  |    0.0085|
|arc_easy      |      1|none  |     0|acc       |     0.2614|±  |    0.0090|
|              |       |none  |     0|acc_norm  |     0.2744|±  |    0.0092|

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0003
- train_batch_size: 2
- eval_batch_size: 2
- seed: 80085
- gradient_accumulation_steps: 32
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.99) and epsilon=1e-07
- lr_scheduler_type: inverse_sqrt
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 1.0

### Training results

| Training Loss | Epoch  | Step | Validation Loss | Accuracy | Input Tokens Seen |
|:-------------:|:------:|:----:|:---------------:|:--------:|:-----------------:|
| 6.4019        | 0.0684 | 400  | 6.7690          | 0.1278   | 52428800          |
| 6.0547        | 0.1368 | 800  | 6.4214          | 0.1460   | 104857600         |
| 5.8133        | 0.2052 | 1200 | 6.2566          | 0.1550   | 157286400         |
| 5.7212        | 0.2736 | 1600 | 6.1411          | 0.1620   | 209715200         |
| 5.6175        | 0.3420 | 2000 | 6.0502          | 0.1669   | 262144000         |
| 5.5014        | 0.4104 | 2400 | 5.9827          | 0.1687   | 314572800         |
| 5.4882        | 0.4788 | 2800 | 5.9203          | 0.1731   | 367001600         |
| 5.3972        | 0.5472 | 3200 | 5.8614          | 0.1782   | 419430400         |
| 5.3983        | 0.6156 | 3600 | 5.8340          | 0.1773   | 471859200         |
| 5.3175        | 0.6840 | 4000 | 5.7916          | 0.1814   | 524288000         |
| 5.3014        | 0.7524 | 4400 | 5.7565          | 0.1814   | 576716800         |
| 5.2749        | 0.8208 | 4800 | 5.7303          | 0.1849   | 629145600         |
| 5.2264        | 0.8892 | 5200 | 5.6993          | 0.1850   | 681574400         |
| 5.2107        | 0.9576 | 5600 | 5.6745          | 0.1884   | 734003200         |


### Framework versions

- Transformers 4.40.1
- Pytorch 2.3.0+cu121
- Datasets 2.19.0
- Tokenizers 0.19.1