pszemraj
/

stablelm-4e1t-2b-v0.1

Text Generation

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

stablelm-4e1t-2b-v0.1 / README.md

pszemraj's picture

Update README.md

70e40c3 verified 4 months ago

|

history blame contribute delete

No virus

3.12 kB

	---
	license: cc-by-sa-4.0
	base_model: stabilityai/stablelm-3b-4e1t
	tags:
	- axolotl
	- generated_from_trainer
	model-index:
	- name: stablelm-4e1t-2b-v0.1
	results: []
	language:
	- en
	---

	# stablelm-4e1t-2b-v0.1


	This is a layer pruning experiment based off of [stablelm-3b-4e1t](https://hf.co/stabilityai/stablelm-3b-4e1t):

	- 10 layers pruned with [PruneMe](https://github.com/pszemraj/PruneMe/tree/upgrades)/MergeKit
	- layers selected using [BEE-spoke-data/fineweb-100k_en-med](https://hf.co/datasets/BEE-spoke-data/fineweb-100k_en-med)
	- brief subsequent continued pretraining @ ctx 4096
	- data: 10k rows of FineWeb (different than pruning data) + some curated data
	- wandb [here](https://wandb.ai/pszemraj/llama3-pruning)


	## details

	[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
	<details><summary>See axolotl config</summary>

	### config


	axolotl version: `0.4.0`
	```yaml
	base_model: pszemraj/stablelm-3b-4e1t-prune10
	model_type: AutoModelForCausalLM
	tokenizer_type: AutoTokenizer

	strict: false
	seed: 80085

	# dataset
	datasets:
	- path: BEE-spoke-data/KI-smorgasbord_fw-small
	type: completion # format from earlier
	field: text # Optional[str] default: text, field to use for completion data
	val_set_size: 0.015

	sequence_len: 4096
	sample_packing: true
	pad_to_sequence_len: false
	train_on_inputs: false
	group_by_length: false

	# WANDB
	wandb_project: llama3-pruning
	wandb_entity: pszemraj
	wandb_watch: gradients
	wandb_name: stablelm-4e1t-2b-v0.1
	hub_model_id: pszemraj/stablelm-4e1t-2b-v0.1
	hub_strategy: every_save

	gradient_accumulation_steps: 16
	micro_batch_size: 1
	num_epochs: 2
	optimizer: adamw_torch_fused # paged_adamw_32bit
	weight_decay: 0.05
	lr_scheduler: cosine
	learning_rate: 5e-5
	warmup_ratio: 0.1

	load_in_8bit: false
	load_in_4bit: false
	bf16: true
	tf32: true

	flash_attention: true
	torch_compile: true # requires >= torch 2.0, may sometimes cause problems
	torch_compile_backend: inductor # Optional[str]
	gradient_checkpointing: true
	gradient_checkpointing_kwargs:
	use_reentrant: false

	# hyperparams for freq of evals, saving, etc
	evals_per_epoch: 5
	saves_per_epoch: 3
	save_safetensors: true
	save_total_limit: 1
	output_dir: ./output-axolotl/output-model-2b
	logging_steps: 8

	deepspeed:

	special_tokens:
	pad_token: <\|end_of_text\|>

	```

	</details><br>


	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| No log \| 0.0006 \| 1 \| 4.4344 \|
	\| 2.6558 \| 0.2004 \| 332 \| 2.7150 \|
	\| 2.6548 \| 0.4007 \| 664 \| 2.6196 \|
	\| 2.5435 \| 0.6011 \| 996 \| 2.5981 \|
	\| 2.5133 \| 0.8014 \| 1328 \| 2.5502 \|
	\| 2.489 \| 1.0018 \| 1660 \| 2.5106 \|
	\| 2.2671 \| 1.1898 \| 1992 \| 2.4944 \|
	\| 2.2038 \| 1.3902 \| 2324 \| 2.4843 \|
	\| 2.2513 \| 1.5905 \| 2656 \| 2.4761 \|
	\| 2.1654 \| 1.7909 \| 2988 \| 2.4769 \|

	---