creative-writer-v0.1-alfa-35b / README.md

Update README.md

7dfc5f9 verified 2 months ago

4.23 kB

	---
	library_name: transformers
	license: cc-by-nc-4.0
	tags:
	- creative-writing
	- creative-writer
	- multiplicative-lora
	---

	An experimental model, fine-tuned using the ["multiplicative-LoRA" method](#the-multiplicative-lora-method) on [c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01).

	Other experimental models, based off `creative-writer-v0.1-alfa-35b` that attempt to encourage more diverse/creative text generation:

	- [creative-writer-v0.1-bravo-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-bravo-35b) - Scaled the pre-softmax logits by `1.1` during training (and then reset after training).
	- [creative-writer-v0.1-charlie-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-charlie-35b) - Scaled the pre-softmax logits by `0.9` during training (and didn't reset after training).
	- [creative-writer-v0.1-delta-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-delta-35b) - Trained using [Focal Loss](https://arxiv.org/abs/1708.02002) with `gamma=2` (instead of stock [Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)).

	---

	# The "multiplicative-LoRA" method

	Uses:

	`h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x`

	instead of the normal "addative-LoRA" method of:

	`h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x`

	I only apply this to the `down_proj` matrices, and skip the last layer's `down_proj` matrix in the same way as [creative-writing-control-vectors-v3.0](https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0).

	This currently requires hacking [PEFT's layer.py](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py) like so:

	```python
	#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
	self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False)
	self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)
	```

	and:

	```python
	#x = x.to(lora_A.weight.dtype)
	temp = result.to(lora_A.weight.dtype)

	if not self.use_dora[active_adapter]:
	#result = result + lora_B(lora_A(dropout(x))) * scaling
	result = result + lora_B(lora_A(dropout(temp))) * scaling
	```

	Then to merge you need to hack [qlora-pipe's merge_lora.py](https://github.com/tdrussell/qlora-pipe/blob/main/merge_lora.py) to use:

	```python
	old_type = tensor.dtype
	tensor = tensor.to(torch.float32)
	tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor
	tensor = tensor.to(old_type)
	```

	---

	# Training

	- Took just under 4 days using dual-A6000 GPUs connected via NVLink, using [qlora-pipe](https://github.com/tdrussell/qlora-pipe).
	- The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same `dataset_combination_mode = 'concatenate'` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter).

	## `config_creative_writer.toml`

	```toml
	# Paths
	model = '/mnt/data/c4ai-command-r-v01'
	output_dir = '/mnt/data/creative-writer-v0.1-alfa-35b'

	# Lora configuration
	lora_rank = 64
	lora_alpha = 64
	lora_dropout = 0.0
	target_modules = ['down_proj']
	layers_to_transform = '0:38' # skip last layer

	# Optimization configuration
	epochs = 1
	lr_scheduler = 'constant'
	warmup_steps = 100
	batch_size_tokens = 8192

	# Performance settings
	pipeline_stages = 2
	logging_steps = 1
	eval_steps = 100
	save_steps = 100
	checkpoint_every_n_minutes = 60
	eval_before_first_step = true
	model_weight_dtype = 'bfloat16'
	lora_weight_dtype = 'bfloat16'
	keep_states = 3
	group_by_length = true
	activation_checkpointing = 'unsloth'

	# Resume a prior run
	resume_from_checkpoint = false

	# Dataset configuration
	dataset_combination_mode = 'concatenate'
	eval_gradient_accumulation_steps = 1

	[optimizer]
	type = 'adamw_kahan'
	lr = 5e-6
	beta1 = 0.9
	beta2 = 0.99
	weight_decay = 0.01

	[[datasets]]
	name = 'books'
	dataset_type = 'textfile'
	dataset_path = '/mnt/data/datasets/ebooks/*.txt'
	sequence_len = 8192
	eval_size = 0.01
	```

	## `ds_creative_writer.json`

	```json
	{
	"train_micro_batch_size_per_gpu": 1,
	"gradient_accumulation_steps": 16,
	"gradient_clipping": 1.0,
	"steps_per_print": 1
	}
	```