chrlu commited on
Commit
c1a11fb
1 Parent(s): 2a22bc1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -19
README.md CHANGED
@@ -7,30 +7,42 @@ tags:
7
  datasets:
8
  - argilla/dpo-mix-7k
9
  model-index:
10
- - name: zephyr-7b-gemma-log_ratio_modulated_loss
11
  results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
- # zephyr-7b-gemma-log_ratio_modulated_loss
18
 
19
  This model is a fine-tuned version of [HuggingFaceH4/zephyr-7b-gemma-sft-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-sft-v0.1) on the argilla/dpo-mix-7k dataset.
20
 
21
- ## Model description
22
-
23
- More information needed
24
-
25
- ## Intended uses & limitations
26
-
27
- More information needed
28
 
29
- ## Training and evaluation data
30
 
31
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- ## Training procedure
34
 
35
  ### Training hyperparameters
36
 
@@ -49,10 +61,6 @@ The following hyperparameters were used during training:
49
  - lr_scheduler_warmup_ratio: 0.1
50
  - num_epochs: 2
51
 
52
- ### Training results
53
-
54
-
55
-
56
  ### Framework versions
57
 
58
  - Transformers 4.40.1
 
7
  datasets:
8
  - argilla/dpo-mix-7k
9
  model-index:
10
+ - name: DiscoPOP-zephyr-7b-gemma
11
  results: []
12
  ---
13
 
14
+ # DiscoPOP-zephyr-7b-gemma
 
 
 
15
 
16
  This model is a fine-tuned version of [HuggingFaceH4/zephyr-7b-gemma-sft-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-sft-v0.1) on the argilla/dpo-mix-7k dataset.
17
 
18
+ See the codebase to generate it here: [https://github.com/SakanaAI/DiscoPOP](https://github.com/SakanaAI/DiscoPOP)
 
 
 
 
 
 
19
 
20
+ ## Model description
21
 
22
+ This model is identical in training to [HuggingFaceH4/zephyr-7b-gemma-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1), except instead of using Direct Preference Optimization (DPO), it uses DiscoPOP.
23
+
24
+ DiscoPOP is our Discovered Preference Optimization algorithm, which is defined as follows:
25
+
26
+ ```
27
+ def log_ratio_modulated_loss(
28
+ self,
29
+ policy_chosen_logps: torch.FloatTensor,
30
+ policy_rejected_logps: torch.FloatTensor,
31
+ reference_chosen_logps: torch.FloatTensor,
32
+ reference_rejected_logps: torch.FloatTensor,
33
+ ) -> torch.FloatTensor:
34
+ pi_logratios = policy_chosen_logps - policy_rejected_logps
35
+ ref_logratios = reference_chosen_logps - reference_rejected_logps
36
+ logits = pi_logratios - ref_logratios
37
+ # Modulate the mixing coefficient based on the log ratio magnitudes
38
+ log_ratio_modulation = torch.sigmoid(logits)
39
+ logistic_component = -F.logsigmoid(self.beta * logits)
40
+ exp_component = torch.exp(-self.beta * logits)
41
+ # Blend between logistic and exponential component based on log ratio modulation
42
+ losses = logistic_component * (1 - log_ratio_modulation) + exp_component * log_ratio_modulation
43
+ return losses
44
+ ```
45
 
 
46
 
47
  ### Training hyperparameters
48
 
 
61
  - lr_scheduler_warmup_ratio: 0.1
62
  - num_epochs: 2
63
 
 
 
 
 
64
  ### Framework versions
65
 
66
  - Transformers 4.40.1