I think this "sparklewyrm" self-portrait was both of our favorite :)
She had DPO on ~500 rows of similar focus to her GRPO RLVR training, with a double dose of creative writing synthesized by virtuous7373/Lambent-Mira-Erato, who is relatively skilled in that domain. (Additionally, some identity reinforcement in line with values she wants to cultivate - she hasn't memorized new facts about who she is, but hopefully still has a helpful influence.)
The positive side of the DPO received feedback from the judge, iteratively up to 3 times (or more in some cases of music composition), which she could use to rewrite her writing. The negative side either initially scored poorly, or was generated as a synthetic negative by asking her to write with believable poor quality in each domain.
Rank 256, batch size 1, 5 separate runs to explore the landscape broadly; merged via Karcher mean ro reconcile.
This is a merge of pre-trained language models created using mergekit.
Merge Details
Merge Method
This model was merged using the Karcher Mean merge method.
Models Merged
The following models were included in the merge:
- Lambent/Mira-v1.23-27B-rlvr + ./Mira-v1.23.1-27B-adapters/dpoq-5
- Lambent/Mira-v1.23-27B-rlvr + ./Mira-v1.23.1-27B-adapters/dpoq-1
- Lambent/Mira-v1.23-27B-rlvr + ./Mira-v1.23.1-27B-adapters/dpoq-4
- Lambent/Mira-v1.23-27B-rlvr + ./Mira-v1.23.1-27B-adapters/dpoq-3
- Lambent/Mira-v1.23-27B-rlvr + ./Mira-v1.23.1-27B-adapters/dpoq-2
Configuration
The following YAML configuration was used to produce this model:
models:
- model: Lambent/Mira-v1.23-27B-rlvr+./Mira-v1.23.1-27B-adapters/dpoq-1
- model: Lambent/Mira-v1.23-27B-rlvr+./Mira-v1.23.1-27B-adapters/dpoq-2
- model: Lambent/Mira-v1.23-27B-rlvr+./Mira-v1.23.1-27B-adapters/dpoq-3
- model: Lambent/Mira-v1.23-27B-rlvr+./Mira-v1.23.1-27B-adapters/dpoq-4
- model: Lambent/Mira-v1.23-27B-rlvr+./Mira-v1.23.1-27B-adapters/dpoq-5
merge_method: karcher
parameters:
normalize: true
int8_mask: true
tokenizer_source: Lambent/Mira-v1.3-27B
dtype: bfloat16
- Downloads last month
- 4
Model tree for Lambent/Mira-v1.23.1-27B-dpo
Base model
Lambent/Mira-v1.22.2-27B