creative-writer-v0.1-alfa-35b / README.md

Update README.md

5571b53 verified about 2 months ago

13.9 kB

	---
	library_name: transformers
	license: cc-by-nc-4.0
	tags:
	- creative-writing
	- creative-writer
	- multiplicative-lora
	---

	An experimental model, fine-tuned using the ["multiplicative-LoRA" method](#the-multiplicative-lora-method) on [c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01).

	Other experimental models which attempt to encourage more diverse/creative text generation:

	- [creative-writer-v0.1-bravo-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-bravo-35b) - Scaled the pre-softmax logits by `1.1` during training (and then reset after training).
	- [creative-writer-v0.1-charlie-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-charlie-35b) - Scaled the pre-softmax logits by `0.9` during training (and didn't reset after training).

	<details> <summary>Click to see some (brief) tests on the effect of these changes</summary>

	#### Using `command-r-3-2024` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/GqrDZnKk-fqRvihfxL014.png)

	#### Using `creative-writer-v0.1-alfa:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/imba_ELU1lCXR309u4CGY.png)

	#### Using `creative-writer-v0.1-bravo:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/7IzQQ_f5GGGI6-AFqG4mT.png)

	#### Using `creative-writer-v0.1-charlie:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/b4kJeQbbA9Dor3pgNZlVA.png)

	---

	#### Using `command-r-3-2024` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/aHEODwL9oEvWOQe52SUMy.png)

	#### Using `creative-writer-v0.1-alfa:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/2BDugSmPlJQCWp8QuHXDh.png)

	#### Using `creative-writer-v0.1-bravo:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/a7VEVHFgC1BBzkZIAECYW.png)

	#### Using `creative-writer-v0.1-charlie:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/ejs2KklluH1dhVCv9kQBl.png)

	---

	#### Using `command-r-3-2024` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/4cHQfo1PvJI8bB2tyxXSS.png)

	#### Using `creative-writer-v0.1-alfa:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/Kf9x9GgGBh1ed5xINtill.png)

	#### Using `creative-writer-v0.1-alfa:35b` with `temperature = 1.1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/CpL_86aPtdxl2rUAJgGdX.png)

	#### Using `creative-writer-v0.1-bravo:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/KJZj-8bPYI78m9VZ45gNr.png)

	#### Using `creative-writer-v0.1-bravo:35b` with `temperature = 0.9` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/qSeuI1MCIc__4VG9YC_pn.png)

	#### Using `creative-writer-v0.1-charlie:35b` with `temperature = 1` and `min-p = 0.01`:

	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/PwnDkctZ273zMHC-Ta_YU.png)

	---

	Observations:

	- Up-scaling of the pre-softmax logits during training used by `creative-writer-v0.1-bravo:35b` looks the most promising.
	- Down-scaling of the pre-softmax logits during training used by `creative-writer-v0.1-charlie:35b` looks to be very similar to inference-time temperature adjustment.
	- It may be better to just leave the pre-softmax logits up-scaled after training and then let the user perform inference-time temperature adjustment.

	</details>

	---

	# Usage

	- Use the normal `command-r` chat template: `'<\|START_OF_TURN_TOKEN\|><\|USER_TOKEN\|>prompt<\|END_OF_TURN_TOKEN\|><\|START_OF_TURN_TOKEN\|><\|CHATBOT_TOKEN\|>reply...'`.
	- I suggest using no system prompt with this (and all other `Cohere` models!), as it writes much better without it IMO...
	- You *MUST* use some (small) value of min-p with this such as `0.01`(and with the original `c4ai-command-r-v01` model), or else the model will output gibberish!

	---

	# The "multiplicative-LoRA" method

	Uses:

	`h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x`

	or equivalently:

	`h = tensor @ x`

	`h' = h + lora_B @ lora_A @ h`

	instead of the normal "additive-LoRA" method of:

	`h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x`

	I only apply this to the `down_proj` matrices, and skipped the last layer's `down_proj` matrix in the same way as [creative-writing-control-vectors-v3.0](https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0).

	This currently requires hacking [PEFT's layer.py](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py) like so:

	```python
	#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
	self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False)
	self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)
	```

	and:

	```python
	#x = x.to(lora_A.weight.dtype)
	temp = result.to(lora_A.weight.dtype)

	if not self.use_dora[active_adapter]:
	#result = result + lora_B(lora_A(dropout(x))) * scaling
	result = result + lora_B(lora_A(dropout(temp))) * scaling
	```

	Then to merge you need to hack [qlora-pipe's merge_lora.py](https://github.com/tdrussell/qlora-pipe/blob/main/merge_lora.py) to use:

	```python
	old_type = tensor.dtype
	tensor = tensor.to(torch.float32)
	tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor
	tensor = tensor.to(old_type)
	```

	---

	# The "multiplicative-LoRA" method's link to control-vectors (and "abliteration")

	There are actually 3 existing "multiplicative-LoRA" methods in [PEFT/tuners](https://github.com/huggingface/peft/tree/main/src/peft/tuners):

	- https://github.com/huggingface/peft/tree/main/src/peft/tuners/oft (https://arxiv.org/abs/2306.07280)
	- https://github.com/huggingface/peft/tree/main/src/peft/tuners/boft (https://arxiv.org/abs/2311.06243)
	- https://github.com/huggingface/peft/tree/main/src/peft/tuners/hra (https://arxiv.org/abs/2405.17484)

	but as explained in [this conceptual guide](https://github.com/huggingface/peft/blob/main/docs/source/conceptual_guides/oft.md):

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/AQ_m88vjvYXZwesZxrJDj.png)

	all 3 methods deliberately maintain [orthogonality](https://en.wikipedia.org/wiki/Orthogonal_matrix) (as a form of [regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics); likely [more suited to image generation models than LLMs](https://arxiv.org/abs/2405.17484)), and thus are more restrictive in the types of transformations they can perform (ie: [Rotations](https://en.wikipedia.org/wiki/Rotation) and/or [Improper Rotations](https://en.wikipedia.org/wiki/Improper_rotation) only; with no scaling or sheer transformations possible...).

	For example, these can't perform the orthogonal projection needed for ["abliteration"](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction):

	`h' = h - v @ v^T @ h`

	whereas the general (non-orthogonal) "multiplicative-LoRA" method can (in theory) do this by choosing to set `u = -v` like so:

	`h' = h + u @ v^T @ h`

	This general (non-orthogonal) "multiplicative-LoRA" method can also (in theory) perform [Householder Transformation(s)](https://en.wikipedia.org/wiki/Householder_transformation):

	`h' = h - 2 * v @ v^T @ h`

	by choosing to set `u = -2v` like so:

	`h' = h + u @ v^T @ h`

	In general, the way to think about these (non-orthogonal) "multiplicative-LoRAs" is as a kind of "conditional control-vector":

	- Each vector in `lora_A` looks for a certain dirrection, and via the dot-product it generates a (signed) weighting factor that measures the similarity between the *output* of the `down_proj` transformation and the specific vector in `lora_A`.
	- Each corresponding vector in `lora_B` then gets added to the hidden state / residual stream, scaled by the corresponding (signed) weighting factor.

	So instead of having just a single vector that we add (and in essence adding a `'.bias'` weight to create an [affine transformation](https://en.wikipedia.org/wiki/Affine_transformation)), we now have many different control vectors that can be added (stored in `lora_B`), based on how well they match another set of "direction detection vectors" (stored in `lora_A`).

	NOTE: The [LoRA+](https://arxiv.org/abs/2402.12354) paper uses a similar way of viewing the purpose of `lora_A` and `lora_B`:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/vZ2Gys3huKAWIVe0wz2-q.png)

	but whereas `lora_A` looks at the *input* to the transformation for "additive-LoRAs"; these new (non-orthogonal) "multiplicative-LoRAs" instead use `lora_A` to look at the *output* of the (`down_proj`) transformation...

	---

	# Training

	- Took just over 4 days using dual-A6000 GPUs connected via NVLink, using [qlora-pipe](https://github.com/tdrussell/qlora-pipe).
	- The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same `dataset_combination_mode = 'concatenate'` and `dataset_type = 'textfile'` as tdrussell's [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used.
	- I used the same `sequence_len = 8192` and `batch_size_tokens = 8192` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3), but since I only target `down_proj` in a very specific way; I doubt this will affect the useable context length of the model, and 8k tokens should be around 2-3 user-AI rounds' worth of interaction in real terms.
	- I used `pipeline_stages = 2` and `"gradient_accumulation_steps": 16` to roughly match the "tokens-per-step" as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used.
	- I used a much lower learning-rate of `5e-6`, as the `5e-5` value used by [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) dropped the evaluation loss far too quickly (likely due to adapting `down_proj` only being "almost convex").
	- I set `lora_dropout = 0.0` as it doesn't really make sense to use with `epochs = 1`.
	- I left `weight_decay = 0.01` but not convinced this is really doing anything useful, and may actually even be harming the adaption of the early `down_proj` matrices where the gradient signal is likely to be much weaker.
	- I found via experimentation that setting `lora_rank` and `lora_alpha` to a very low value (as a form of [Spectral Regularization](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3)), can cause the training to get stuck at [saddle-points](https://en.wikipedia.org/wiki/Saddle_point) as explained in [this](https://arxiv.org/abs/2402.11867) paper; particularly if using stock SGD instead of Adam.
	- In general, I relied mainly on early stopping for Regularization and deliberately set out to undertrain the model (we can always increase the size of the dataset at a later time...).

	## `config_creative_writer.toml`

	```toml
	# Paths
	model = '/mnt/data/c4ai-command-r-v01'
	output_dir = '/mnt/data/creative-writer-v0.1-35b'

	# Lora configuration
	lora_rank = 64
	lora_alpha = 64
	lora_dropout = 0.0
	target_modules = ['down_proj']
	layers_to_transform = '0:38' # skip last layer

	# Optimization configuration
	epochs = 1
	lr_scheduler = 'constant'
	warmup_steps = 100
	batch_size_tokens = 8192

	# Performance settings
	pipeline_stages = 2
	logging_steps = 1
	eval_steps = 100
	save_steps = 100
	checkpoint_every_n_minutes = 60
	eval_before_first_step = true
	model_weight_dtype = 'bfloat16'
	lora_weight_dtype = 'bfloat16'
	keep_states = 3
	group_by_length = true
	activation_checkpointing = 'unsloth'

	# Resume a prior run
	resume_from_checkpoint = false

	# Dataset configuration
	dataset_combination_mode = 'concatenate'
	eval_gradient_accumulation_steps = 1

	[optimizer]
	type = 'adamw_kahan'
	lr = 5e-6
	beta1 = 0.9
	beta2 = 0.99
	weight_decay = 0.01

	[[datasets]]
	name = 'books'
	dataset_type = 'textfile'
	dataset_path = '/mnt/data/books/*.txt'
	sequence_len = 8192
	eval_size = 0.01
	```

	## `ds_creative_writer.json`

	```json
	{
	"train_micro_batch_size_per_gpu": 1,
	"gradient_accumulation_steps": 16,
	"gradient_clipping": 1.0,
	"steps_per_print": 1
	}
	```

	---

	# Graphs

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/kkp6eVYkHp6qJdN3knUju.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/UVkVjyd9GhhpooYIRtxpQ.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/EK1UltB1wLYPeKJxhkXlh.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/67vxL6Ws8zCu360-YDAFl.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/_2XSvOPDkdH5NAVwEcXfO.png)