pszemraj
/

griffin-c3t-8L-v0.02-fineweb

Text Generation

recurrent_gemma

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

griffin-c3t-8L-v0.02-fineweb / README.md

pszemraj's picture

Update README.md

b88752f verified 2 months ago

|

raw history blame contribute delete

No virus

3.9 kB

	---
	license: apache-2.0
	base_model: pszemraj/griffin-v0.01-c3t-8layer-simplewiki-silu
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	datasets:
	- BEE-spoke-data/fineweb-1M_en-med
	language:
	- en
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# griffin-c3t-8L-v0.02-fineweb

	Pretraining experiment with griffin/recurrent_gemma arch

	## Model description

	Further training of [pszemraj/griffin-v0.01-c3t-8layer-simplewiki-silu](https://hf.co/pszemraj/griffin-v0.01-c3t-8layer-simplewiki-silu) on the BEE-spoke-data/fineweb-1M_en-med dataset.
	It achieves the following results on the evaluation set:
	- Loss: 5.1888
	- Accuracy: 0.2326
	- Num Input Tokens Seen: 798621696


	## numbers

	tl;dr its bad/would need more training:

	hf (pretrained=pszemraj/griffin-c3t-8L-v0.02-fineweb,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4

	\| Tasks \|Version\|Filter\|n-shot\| Metric \| Value \| \| Stderr \|
	\|--------------\|------:\|------\|-----:\|----------\|----------:\|---\|---------:\|
	\|winogrande \| 1\|none \| 0\|acc \| 0.5146\|± \| 0.0140\|
	\|piqa \| 1\|none \| 0\|acc \| 0.5511\|± \| 0.0116\|
	\| \| \|none \| 0\|acc_norm \| 0.5261\|± \| 0.0116\|
	\|openbookqa \| 1\|none \| 0\|acc \| 0.1140\|± \| 0.0142\|
	\| \| \|none \| 0\|acc_norm \| 0.2240\|± \| 0.0187\|
	\|lambada_openai\| 1\|none \| 0\|perplexity\|209503.2246\|± \|11711.4041\|
	\| \| \|none \| 0\|acc \| 0.0000\|± \| 0.0000\|
	\|boolq \| 2\|none \| 0\|acc \| 0.3783\|± \| 0.0085\|
	\|arc_easy \| 1\|none \| 0\|acc \| 0.2593\|± \| 0.0090\|
	\| \| \|none \| 0\|acc_norm \| 0.2774\|± \| 0.0092\|


	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0003
	- train_batch_size: 2
	- eval_batch_size: 2
	- seed: 80085
	- gradient_accumulation_steps: 32
	- total_train_batch_size: 64
	- optimizer: Adam with betas=(0.9,0.99) and epsilon=1e-07
	- lr_scheduler_type: inverse_sqrt
	- lr_scheduler_warmup_ratio: 0.05
	- num_epochs: 1.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| Input Tokens Seen \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:--------:\|:-----------------:\|
	\| 6.0703 \| 0.0656 \| 400 \| 6.2332 \| 0.1701 \| 52428800 \|
	\| 5.723 \| 0.1313 \| 800 \| 5.9116 \| 0.1893 \| 104857600 \|
	\| 5.5106 \| 0.1969 \| 1200 \| 5.7516 \| 0.1976 \| 157286400 \|
	\| 5.455 \| 0.2626 \| 1600 \| 5.6427 \| 0.2032 \| 209715200 \|
	\| 5.3236 \| 0.3282 \| 2000 \| 5.5567 \| 0.2103 \| 262144000 \|
	\| 5.2764 \| 0.3938 \| 2400 \| 5.4919 \| 0.2151 \| 314572800 \|
	\| 5.1625 \| 0.4595 \| 2800 \| 5.4436 \| 0.2176 \| 367001600 \|
	\| 5.1851 \| 0.5251 \| 3200 \| 5.3975 \| 0.2206 \| 419430400 \|
	\| 5.0618 \| 0.5908 \| 3600 \| 5.3624 \| 0.2199 \| 471859200 \|
	\| 5.0278 \| 0.6564 \| 4000 \| 5.3242 \| 0.2236 \| 524288000 \|
	\| 5.0389 \| 0.7220 \| 4400 \| 5.2920 \| 0.2264 \| 576716800 \|
	\| 4.9732 \| 0.7877 \| 4800 \| 5.2674 \| 0.2276 \| 629145600 \|
	\| 4.9375 \| 0.8533 \| 5200 \| 5.2418 \| 0.2292 \| 681574400 \|
	\| 4.9322 \| 0.9190 \| 5600 \| 5.2166 \| 0.2312 \| 734003200 \|
	\| 4.8818 \| 0.9846 \| 6000 \| 5.1981 \| 0.2315 \| 786432000 \|


	### Framework versions

	- Transformers 4.40.1
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.0
	- Tokenizers 0.19.1