pszemraj
/

griffin-llama3t-8L-v0.02-fineweb

Text Generation

recurrent_gemma

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

griffin-llama3t-8L-v0.02-fineweb / README.md

pszemraj's picture

Update README.md

96dac12 verified 4 months ago

|

history blame contribute delete

No virus

3.79 kB

	---
	license: apache-2.0
	base_model: pszemraj/griffin-1024-llama3t-8layer-simplewiki-silu
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: griffin-1024-llama3t-8layer-simplewiki-silu-fineweb-1M_en-med-vN
	results: []
	datasets:
	- BEE-spoke-data/fineweb-1M_en-med
	language:
	- en
	---

	# griffin-llama3t-8L-v0.02-fineweb

	Pretraining experiment with griffin/recurrent_gemma arch. This one uses the Llama-3 tokenizer.

	## Model description

	Further training of [pszemraj/griffin-1024-llama3t-8layer-simplewiki-silu](https://huggingface.co/pszemraj/griffin-1024-llama3t-8layer-simplewiki-silu) on the BEE-spoke-data/fineweb-1M_en-med dataset.
	It achieves the following results on the evaluation set:
	- Loss: 5.6538
	- Accuracy: 0.1881
	- Num Input Tokens Seen: 766509056

	## evals

	tl;dr its bad/would need more training:


	hf (pretrained=pszemraj/griffin-llama3t-8L-v0.02-fineweb,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4

	\| Tasks \|Version\|Filter\|n-shot\| Metric \| Value \| \| Stderr \|
	\|--------------\|------:\|------\|-----:\|----------\|----------:\|---\|---------:\|
	\|winogrande \| 1\|none \| 0\|acc \| 0.4964\|± \| 0.0141\|
	\|piqa \| 1\|none \| 0\|acc \| 0.5332\|± \| 0.0116\|
	\| \| \|none \| 0\|acc_norm \| 0.5299\|± \| 0.0116\|
	\|openbookqa \| 1\|none \| 0\|acc \| 0.1280\|± \| 0.0150\|
	\| \| \|none \| 0\|acc_norm \| 0.2320\|± \| 0.0189\|
	\|lambada_openai\| 1\|none \| 0\|perplexity\|638060.0702\|± \|43608.0044\|
	\| \| \|none \| 0\|acc \| 0.0000\|± \| 0.0000\|
	\|boolq \| 2\|none \| 0\|acc \| 0.3783\|± \| 0.0085\|
	\|arc_easy \| 1\|none \| 0\|acc \| 0.2614\|± \| 0.0090\|
	\| \| \|none \| 0\|acc_norm \| 0.2744\|± \| 0.0092\|

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0003
	- train_batch_size: 2
	- eval_batch_size: 2
	- seed: 80085
	- gradient_accumulation_steps: 32
	- total_train_batch_size: 64
	- optimizer: Adam with betas=(0.9,0.99) and epsilon=1e-07
	- lr_scheduler_type: inverse_sqrt
	- lr_scheduler_warmup_ratio: 0.05
	- num_epochs: 1.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| Input Tokens Seen \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:--------:\|:-----------------:\|
	\| 6.4019 \| 0.0684 \| 400 \| 6.7690 \| 0.1278 \| 52428800 \|
	\| 6.0547 \| 0.1368 \| 800 \| 6.4214 \| 0.1460 \| 104857600 \|
	\| 5.8133 \| 0.2052 \| 1200 \| 6.2566 \| 0.1550 \| 157286400 \|
	\| 5.7212 \| 0.2736 \| 1600 \| 6.1411 \| 0.1620 \| 209715200 \|
	\| 5.6175 \| 0.3420 \| 2000 \| 6.0502 \| 0.1669 \| 262144000 \|
	\| 5.5014 \| 0.4104 \| 2400 \| 5.9827 \| 0.1687 \| 314572800 \|
	\| 5.4882 \| 0.4788 \| 2800 \| 5.9203 \| 0.1731 \| 367001600 \|
	\| 5.3972 \| 0.5472 \| 3200 \| 5.8614 \| 0.1782 \| 419430400 \|
	\| 5.3983 \| 0.6156 \| 3600 \| 5.8340 \| 0.1773 \| 471859200 \|
	\| 5.3175 \| 0.6840 \| 4000 \| 5.7916 \| 0.1814 \| 524288000 \|
	\| 5.3014 \| 0.7524 \| 4400 \| 5.7565 \| 0.1814 \| 576716800 \|
	\| 5.2749 \| 0.8208 \| 4800 \| 5.7303 \| 0.1849 \| 629145600 \|
	\| 5.2264 \| 0.8892 \| 5200 \| 5.6993 \| 0.1850 \| 681574400 \|
	\| 5.2107 \| 0.9576 \| 5600 \| 5.6745 \| 0.1884 \| 734003200 \|


	### Framework versions

	- Transformers 4.40.1
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.0
	- Tokenizers 0.19.1