library_name: transformers
license: cc-by-nc-4.0
tags:
- creative-writing
- creative-writer
- multiplicative-lora
An experimental model, fine-tuned using the "multiplicative-LoRA" method on c4ai-command-r-v01.
Other experimental models, based off creative-writer-v0.1-alfa-35b
that attempt to encourage more diverse/creative text generation:
- creative-writer-v0.1-bravo-35b - Scaled the pre-softmax logits by
1.1
during training (and then reset after training). - creative-writer-v0.1-charlie-35b - Scaled the pre-softmax logits by
0.9
during training (and didn't reset after training). - [CURRENTLY TRAINING...] creative-writer-v0.1-delta-35b - Trained using Focal Loss with
gamma=2
(instead of stock Cross Entropy Loss).
Usage
- Use the normal
command-r
chat template:'<|START_OF_TURN_TOKEN|><|USER_TOKEN|>prompt<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>reply...'
. - I suggest using no system prompt with this (and all other
Cohere
models!), as it writes much better without it IMO... - You MUST use some (small) value of min-p with this such as
0.01
(and with the originalc4ai-command-r-v01
model), or else the model will output gibberish!
The "multiplicative-LoRA" method
Uses:
h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x
or equivalently:
h = tensor @ x
h' = h + lora_B @ lora_A @ h
instead of the normal "additive-LoRA" method of:
h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x
I only apply this to the down_proj
matrices, and skipped the last layer's down_proj
matrix in the same way as creative-writing-control-vectors-v3.0.
This currently requires hacking PEFT's layer.py like so:
#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False)
self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)
and:
#x = x.to(lora_A.weight.dtype)
temp = result.to(lora_A.weight.dtype)
if not self.use_dora[active_adapter]:
#result = result + lora_B(lora_A(dropout(x))) * scaling
result = result + lora_B(lora_A(dropout(temp))) * scaling
Then to merge you need to hack qlora-pipe's merge_lora.py to use:
old_type = tensor.dtype
tensor = tensor.to(torch.float32)
tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor
tensor = tensor.to(old_type)
The "multiplicative-LoRA" method's link to control-vectors (and "abliteration")
There are actually 3 existing "multiplicative-LoRA" methods in PEFT/tuners:
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/oft (https://arxiv.org/abs/2306.07280)
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/boft (https://arxiv.org/abs/2311.06243)
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/hra (https://arxiv.org/abs/2405.17484)
but as explained in this conceptual guide:
all 3 methods deliberately maintain orthogonality (as a form of regularization; likely more suited to image generation models than LLMs), and thus are more restrictive in the types of transformations they can perform (ie: Rotations and/or Improper Rotations only; with no scaling or sheer transformations possible...).
For example, these can't perform the orthogonal projection needed for "abliteration":
h' = h - v @ v^T @ h
whereas the general (non-orthogonal) "multiplicative-LoRA" method can (in theory) do this by choosing to set u = -v
like so:
h' = h + u @ v^T @ h
This general (non-orthogonal) "multiplicative-LoRA" method can also (in theory) perform Householder Transformation(s):
h' = h - 2 * v @ v^T @ h
by choosing to set u = -2v
like so:
h' = h + u @ v^T @ h
In general, the way to think about these (non-orthogonal) "multiplicative-LoRAs" is as a kind of "conditional control-vector":
- Each vector in
lora_A
looks for a certain dirrection, and via the dot-product it generates a (signed) weighting factor that measures the similarity between the output of thedown_proj
transformation and the specific vector inlora_A
. - Each corresponding vector in
lora_B
then gets added to the hidden state / residual stream, scaled by the corresponding (signed) weighting factor.
So instead of having just a single vector that we add (and in essence adding a '.bias'
weight to create an affine transformation), we now have many different control vectors that can be added (stored in lora_B
), based on how well they match another set of "direction detection vectors" (stored in lora_A
).
NOTE: The LoRA+ paper uses a similar way of viewing the purpose of lora_A
and lora_B
:
but whereas lora_A
looks at the input to the transformation for "additive-LoRAs"; these new (non-orthogonal) "multiplicative-LoRAs" instead use lora_A
to look at the output of the (down_proj
) transformation...
Training
- Took just over 4 days using dual-A6000 GPUs connected via NVLink, using qlora-pipe.
- The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same
dataset_combination_mode = 'concatenate'
anddataset_type = 'textfile'
as tdrussell's Llama-3-70B-Instruct-Storywriter used. - I used the same
sequence_len = 8192
andbatch_size_tokens = 8192
as Llama-3-70B-Instruct-Storywriter, but since I only targetdown_proj
in a very specific way; I doubt this will affect the useable context length of the model, and 8k tokens should be around 2-3 user-AI rounds' worth of interaction in real terms. - I used
pipeline_stages = 2
and"gradient_accumulation_steps": 16
to roughly match the "tokens-per-step" as Llama-3-70B-Instruct-Storywriter used. - I used a much lower learning-rate of
5e-6
, as the5e-5
value used by Llama-3-70B-Instruct-Storywriter dropped the evaluation loss far too quickly (likely due to adaptingdown_proj
only being "almost convex"). - I set
lora_dropout = 0.0
as it doesn't really make sense to use withepochs = 1
. - I left
weight_decay = 0.01
but not convinced this is really doing anything useful, and may actually even be harming the adaption of the earlydown_proj
matrices where the gradient signal is likely to be much weaker. - I found via experimentation that setting
lora_rank
andlora_alpha
to a very low value (as a form of Spectral Regularization), can cause the training to get stuck at saddle-points as explained in this paper; particularly if using stock SGD instead of Adam. - In general, I relied mainly on early stopping for Regularization and deliberately set out to undertrain the model (we can always increase the size of the dataset at a later time...).
config_creative_writer.toml
# Paths
model = '/mnt/data/c4ai-command-r-v01'
output_dir = '/mnt/data/creative-writer-v0.1-alfa-35b'
# Lora configuration
lora_rank = 64
lora_alpha = 64
lora_dropout = 0.0
target_modules = ['down_proj']
layers_to_transform = '0:38' # skip last layer
# Optimization configuration
epochs = 1
lr_scheduler = 'constant'
warmup_steps = 100
batch_size_tokens = 8192
# Performance settings
pipeline_stages = 2
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
model_weight_dtype = 'bfloat16'
lora_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'
# Resume a prior run
resume_from_checkpoint = false
# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 1
[optimizer]
type = 'adamw_kahan'
lr = 5e-6
beta1 = 0.9
beta2 = 0.99
weight_decay = 0.01
[[datasets]]
name = 'books'
dataset_type = 'textfile'
dataset_path = '/mnt/data/datasets/ebooks/*.txt'
sequence_len = 8192
eval_size = 0.01
ds_creative_writer.json
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 16,
"gradient_clipping": 1.0,
"steps_per_print": 1
}