Aura-MoE-2x4B / README.md

Update README.md

73aef05 verified 8 days ago

3.82 kB

	---
	license: apache-2.0
	datasets:
	- Mielikki/Erebus-87k
	- FourOhFour/Instruct_Phase
	- FourOhFour/RP_Phase
	- anthracite-core/full-opus-chosen-hermes-rejected-kto-v1
	language:
	- en
	base_model:
	- IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml
	---
	## Aura-MoE-2x4B

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/626dfb8786671a29c715f8a9/LpCTIR45g099eXDIwYmKa.png)

	## Introduction

	Aura-MoE-2x4B is a state of the art dedicated roleplaying model designed to fulfill your every desire.

	The finetunes used in this merge saw several hundreds of millions of tokens of completion, instruction and roleplaying data. A Kahneman-Tversky Optimization was applied to both heal and give this model a unique output style.

	This model can be considered inferior to [Aura-MoE-2x4B-v2](https://huggingface.co/AuraIndustries/Aura-MoE-2x4B-v2) which is a direct improvement.

	Developed by Aura Industries, with contributions from Anthracite Org

	## Model Details

	- Model Name: Aura-MoE-2x4B
	- Base Model: [IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml](https://huggingface.co/IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml)
	- Model Type: Chat Completions
	- Prompt Format: ChatML
	- License: Apache-2.0
	- Language: English
	- Max Context: 8,192+ tokens

	## License

	This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

	## Quantizations

	Due to the abnormal nature of this model, only static GGUF quantization is available.

	[Static GGUF](https://huggingface.co/mradermacher/Aura-MoE-2x4B-GGUF)

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)

	Coming soon...

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \| N/A\|
	\|IFEval (0-Shot) \| N/A\|
	\|BBH (3-Shot) \| N/A\|
	\|MATH Lvl 5 (4-Shot)\| N/A\|
	\|GPQA (0-shot) \| N/A\|
	\|MuSR (0-shot) \| N/A\|
	\|MMLU-PRO (5-shot) \| N/A\|

	## Training Configuration

	<details><summary>Click here for Mergekit and Axolotl configs</summary>

	MoE Merge

	```yaml
	base_model: FourOhFour/Crispy_Crab_4B
	gate_mode: hidden
	dtype: bfloat16
	experts_per_token: 1
	experts:
	- source_model: FourOhFour/Crispy_Crab_4B
	positive_prompts:
	- "Roleplaying partner"
	- source_model: FourOhFour/Zenith_4B
	positive_prompts:
	- "Instruction following assistant"
	```

	KTO

	```yaml
	base_model: jeiku/2x4Bmoe
	model_type: AutoModelForCausalLM
	tokenizer_type: AutoTokenizer

	load_in_8bit: false
	load_in_4bit: false
	strict: false

	hub_model_id: jeiku/moekto
	hub_strategy: "all_checkpoints"
	push_dataset_to_hub:
	hf_use_auth_token: true

	chat_template: chatml

	rl: kto
	rl_beta: 0.2
	kto_desirable_weight: 0.2

	datasets:
	- path: anthracite-core/full-opus-chosen-hermes-rejected-kto-v1
	type: chatml.argilla

	shuffle_merged_datasets: true
	val_set_size: 0.0
	output_dir: ./outputs/out

	sequence_len: 8192
	sample_packing: false
	eval_sample_packing: false
	pad_to_sequence_len: false

	wandb_project: moekto
	wandb_entity:
	wandb_watch:
	wandb_name: moekto
	wandb_log_model:

	gradient_accumulation_steps: 16
	micro_batch_size: 2
	num_epochs: 2
	max_steps: 500

	optimizer: adamw_8bit
	lr_scheduler: cosine
	learning_rate: 0.00001
	weight_decay: 0.05

	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32: true

	gradient_checkpointing: true
	gradient_checkpointing_kwargs:
	use_reentrant: true
	remove_unused_columns: false
	early_stopping_patience:
	resume_from_checkpoint:
	local_rank:
	logging_steps: 1
	xformers_attention:
	flash_attention: true

	warmup_steps: 10
	evals_per_epoch: 2
	eval_table_size:
	eval_max_new_tokens:
	saves_per_epoch: 1

	debug:
	deepspeed:
	fsdp:
	fsdp_config:
	fsdp:
	fsdp_config:

	special_tokens:
	pad_token: <\|finetune_right_pad_id\|>
	```
	</details><br>