Text Generation
PEFT
Safetensors
trl
dpo
unsloth
conversational
ayoubkirouane commited on
Commit
9f0636c
1 Parent(s): 9993da0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -25
README.md CHANGED
@@ -11,6 +11,7 @@ model-index:
11
  results: []
12
  datasets:
13
  - HuggingFaceH4/ultrafeedback_binarized
 
14
  pipeline_tag: text-generation
15
  ---
16
 
@@ -24,28 +25,4 @@ pipeline_tag: text-generation
24
 
25
  Direct Preference Optimization (DPO) is an algorithm introduced in order to achieve precise control of the behavior of large-scale unsupervised language models (LMs). It is a parameterization of the reward model in Reinforcement Learning from Human Feedback (RLHF) that enables the extraction of the corresponding optimal policy in closed form. This allows for the solution of the standard RLHF problem with only a simple classification loss.
26
 
27
- DPO eliminates the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning, making it stable, performant, and computationally lightweight. Experiments have shown that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. It has been found to be particularly effective in controlling the sentiment of generations and matching or improving response quality in summarization and single-turn dialogue.
28
-
29
-
30
- ### Training hyperparameters
31
-
32
- The following hyperparameters were used during training:
33
- - learning_rate: 5e-06
34
- - train_batch_size: 2
35
- - eval_batch_size: 8
36
- - seed: 42
37
- - gradient_accumulation_steps: 4
38
- - total_train_batch_size: 8
39
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
40
- - lr_scheduler_type: linear
41
- - lr_scheduler_warmup_ratio: 0.1
42
- - num_epochs: 1
43
- - mixed_precision_training: Native AMP
44
-
45
- ### Framework versions
46
-
47
- - PEFT 0.7.1
48
- - Transformers 4.38.0.dev0
49
- - Pytorch 2.1.2
50
- - Datasets 2.16.1
51
- - Tokenizers 0.15.0
 
11
  results: []
12
  datasets:
13
  - HuggingFaceH4/ultrafeedback_binarized
14
+ - ayoubkirouane/Orca-Direct-Preference-Optimization
15
  pipeline_tag: text-generation
16
  ---
17
 
 
25
 
26
  Direct Preference Optimization (DPO) is an algorithm introduced in order to achieve precise control of the behavior of large-scale unsupervised language models (LMs). It is a parameterization of the reward model in Reinforcement Learning from Human Feedback (RLHF) that enables the extraction of the corresponding optimal policy in closed form. This allows for the solution of the standard RLHF problem with only a simple classification loss.
27
 
28
+ DPO eliminates the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning, making it stable, performant, and computationally lightweight. Experiments have shown that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. It has been found to be particularly effective in controlling the sentiment of generations and matching or improving response quality in summarization and single-turn dialogue.