blockblockblock commited on
Commit
b6e2911
1 Parent(s): c803a0e

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: mistral-community/Mixtral-8x22B-v0.1
4
+ tags:
5
+ - trl
6
+ - orpo
7
+ - generated_from_trainer
8
+ datasets:
9
+ - argilla/distilabel-capybara-dpo-7k-binarized
10
+ model-index:
11
+ - name: zephyr-orpo-141b-A35b-v0.1
12
+ results: []
13
+ ---
14
+
15
+ <img src="https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1/resolve/main/logo.png" alt="Zephyr 141B Logo" width="400" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
16
+
17
+
18
+ # Model Card for Zephyr 141B-A35B
19
+
20
+ Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr 141B-A35B is the latest model in the series, and is a fine-tuned version of [mistral-community/Mixtral-8x22B-v0.1](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1) that was trained using a novel alignment algorithm called [Odds Ratio Preference Optimization (ORPO)](https://huggingface.co/papers/2403.07691) with **7k instances** for **1.3 hours** on 4 nodes of 8 x H100s. ORPO does not require an SFT step to achieve high performance and is thus much more computationally efficient than methods like DPO and PPO. To train Zephyr-141B-A35B, we used the [`argilla/distilabel-capybara-dpo-7k-binarized`](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized) preference dataset, which consists of synthetic, high-quality, multi-turn preferences that have been scored via LLMs.
21
+
22
+ > [!NOTE]
23
+ > This model was trained collaboratively between Argilla, KAIST, and Hugging Face
24
+
25
+ ## Model Details
26
+
27
+ ### Model Description
28
+
29
+ <!-- Provide a longer summary of what this model is. -->
30
+
31
+ - **Model type:** A Mixture of Experts (MoE) model with 141B total parameters and 35B active parameters. Fine-tuned on a mix of publicly available, synthetic datasets.
32
+ - **Language(s) (NLP):** Primarily English.
33
+ - **License:** Apache 2.0
34
+ - **Finetuned from model:** [mistral-community/Mixtral-8x22B-v0.1](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1)
35
+
36
+ ### Model Sources
37
+
38
+ <!-- Provide the basic links for the model. -->
39
+
40
+ - **Repository:** https://github.com/huggingface/alignment-handbook
41
+ - **Dataset:** https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized
42
+
43
+ ## Performance
44
+
45
+ Zephyr 141B-A35B was trained to test the effectiveness of ORPO at scale and the underlying dataset contains a mix of general chat capabilities. It achieves strong performance on chat benchmarks like [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [IFEval](https://arxiv.org/abs/2311.07911). The scores reported below were obtained using the [LightEval](https://github.com/huggingface/lighteval) evaluation suite and each prompt has been formatted with the model's corresponding chat template to simulate real-world usage. This is why some scores may differ from those reported in technical reports or on the Open LLM Leaderboard.
46
+
47
+ | Model | MT Bench | IFEval | BBH | AGIEval |
48
+ |-----------------------------------------------------------------------------------------------------|---------:|-------:|------:|--------:|
49
+ | [zephyr-orpo-141b-A35b-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1) | 8.17 | 65.06 | 58.96 | 44.16 |
50
+ | [databricks/dbrx-instruct](https://huggingface.co/databricks/dbrx-instruct) | 8.26 | 52.13 | 48.50 | 41.16 |
51
+ | [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | 8.30 | 55.08 | 45.31 | 47.68 |
52
+
53
+
54
+ ## Intended uses & limitations
55
+
56
+ The model was fine-tuned on a blend of chat, code, math, and reasoning data. Here's how you can run the model using the `pipeline()` function from 🤗 Transformers:
57
+
58
+ ```python
59
+ # pip install 'transformers>=4.39.3'
60
+ # pip install accelerate
61
+
62
+ import torch
63
+ from transformers import pipeline
64
+
65
+ pipe = pipeline(
66
+ "text-generation",
67
+ model="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
68
+ device_map="auto",
69
+ torch_dtype=torch.bfloat16,
70
+ )
71
+ messages = [
72
+ {
73
+ "role": "system",
74
+ "content": "You are Zephyr, a helpful assistant.",
75
+ },
76
+ {"role": "user", "content": "Explain how Mixture of Experts work in language a child would understand."},
77
+ ]
78
+ outputs = pipe(
79
+ messages,
80
+ max_new_tokens=512,
81
+ do_sample=True,
82
+ temperature=0.7,
83
+ top_k=50,
84
+ top_p=0.95,
85
+ )
86
+ print(outputs[0]["generated_text"][-1]["content"])
87
+ ```
88
+
89
+ ## Bias, Risks, and Limitations
90
+
91
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
92
+
93
+ Zephyr 141B-A35B has not been aligned to human preferences for safety within the RLHF phase or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so).
94
+ It is also unknown what the size and composition of the corpus was used to train the base model (`mistral-community/Mixtral-8x22B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the [Falcon 180B model card](https://huggingface.co/tiiuae/falcon-180B#training-data) for an example of this.
95
+
96
+
97
+ ## Training procedure
98
+
99
+ ### Training hyperparameters
100
+
101
+ The following hyperparameters were used during training:
102
+ - learning_rate: 5e-06
103
+ - train_batch_size: 1
104
+ - eval_batch_size: 8
105
+ - seed: 42
106
+ - distributed_type: multi-GPU
107
+ - num_devices: 32
108
+ - total_train_batch_size: 32
109
+ - total_eval_batch_size: 256
110
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
111
+ - lr_scheduler_type: inverse_sqrt
112
+ - lr_scheduler_warmup_steps: 100
113
+ - num_epochs: 3
114
+
115
+ ### Training results
116
+
117
+
118
+
119
+ ### Framework versions
120
+
121
+ - Transformers 4.39.3
122
+ - Pytorch 2.1.2+cu121
123
+ - Datasets 2.18.0
124
+ - Tokenizers 0.15.1
125
+
126
+ ## Citation
127
+
128
+ If you find Zephyr 141B-A35B is useful in your work, please cite the ORPO paper:
129
+
130
+ ```
131
+ @misc{hong2024orpo,
132
+ title={ORPO: Monolithic Preference Optimization without Reference Model},
133
+ author={Jiwoo Hong and Noah Lee and James Thorne},
134
+ year={2024},
135
+ eprint={2403.07691},
136
+ archivePrefix={arXiv},
137
+ primaryClass={cs.CL}
138
+ }
139
+ ```
140
+
141
+ You may also wish to cite the creators of this model:
142
+
143
+ ```
144
+ @misc{zephyr_141b,
145
+ author = {Alvaro Bartolome and Jiwoo Hong and Noah Lee and Kashif Rasul and Lewis Tunstall},
146
+ title = {Zephyr 141B A35B},
147
+ year = {2024},
148
+ publisher = {Hugging Face},
149
+ journal = {Hugging Face repository},
150
+ howpublished = {\url{https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1}}
151
+ }
152
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "train_loss": 0.812556631554107,
4
+ "train_runtime": 4771.9621,
5
+ "train_samples": 6932,
6
+ "train_samples_per_second": 4.358,
7
+ "train_steps_per_second": 0.136
8
+ }
config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "mistral-community/Mixtral-8x22B-v0.1",
3
+ "architectures": [
4
+ "MixtralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 6144,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 16384,
13
+ "max_position_embeddings": 65536,
14
+ "model_type": "mixtral",
15
+ "num_attention_heads": 48,
16
+ "num_experts_per_tok": 2,
17
+ "num_hidden_layers": 56,
18
+ "num_key_value_heads": 8,
19
+ "num_local_experts": 8,
20
+ "output_router_logits": false,
21
+ "rms_norm_eps": 1e-05,
22
+ "rope_theta": 1000000,
23
+ "router_aux_loss_coef": 0.001,
24
+ "router_jitter_noise": 0.0,
25
+ "sliding_window": null,
26
+ "tie_word_embeddings": false,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.39.3",
29
+ "use_cache": true,
30
+ "vocab_size": 32000,
31
+ "quantization_config": {
32
+ "quant_method": "exl2",
33
+ "version": "0.0.16",
34
+ "bits": 2.5,
35
+ "head_bits": 6,
36
+ "calibration": {
37
+ "rows": 100,
38
+ "length": 2048,
39
+ "dataset": "(default)"
40
+ }
41
+ }
42
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.39.3"
6
+ }
logo.png ADDED
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
output-00001-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:87792569a047f3ae41792d9ff23a1a48299fe362f66d87078e2078c4a349f956
3
+ size 8589499112
output-00002-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03222cf50a9e547026eb5f09d5f6d64a35c5b61f326d226fed3e5c151cfbc42f
3
+ size 8590141744
output-00003-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20daad4737f27ae2c5eff04198098ab51a62eb6a7c119574a3a1701d898f5f54
3
+ size 8590071880
output-00004-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec04f0725b394d3c370bbc5e3979b9045a49a7e71deecf9af6ec500297f7069b
3
+ size 8590130328
output-00005-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b1ef1d466d1aa2cbdaacba5395c5c4f6d4d62e4767a7056ee561d1ee28f4918
3
+ size 8579448312
output-00006-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c1698032a4d74492594946322a143e01656624a940513b20dd0c1803d0d199d2
3
+ size 1435487552
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "</s>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
3
+ size 493443
tokenizer_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ }
29
+ },
30
+ "additional_special_tokens": [],
31
+ "bos_token": "<s>",
32
+ "chat_template": "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}",
33
+ "clean_up_tokenization_spaces": false,
34
+ "eos_token": "</s>",
35
+ "legacy": true,
36
+ "model_max_length": 2048,
37
+ "pad_token": "</s>",
38
+ "sp_model_kwargs": {},
39
+ "spaces_between_special_tokens": false,
40
+ "tokenizer_class": "LlamaTokenizer",
41
+ "unk_token": "<unk>",
42
+ "use_default_system_prompt": false
43
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "train_loss": 0.812556631554107,
4
+ "train_runtime": 4771.9621,
5
+ "train_samples": 6932,
6
+ "train_samples_per_second": 4.358,
7
+ "train_steps_per_second": 0.136
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 3.0,
5
+ "eval_steps": 500,
6
+ "global_step": 651,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.05,
13
+ "grad_norm": 117760.0,
14
+ "learning_rate": 5.000000000000001e-07,
15
+ "log_odds_chosen": 0.36438828706741333,
16
+ "log_odds_ratio": -0.6397662162780762,
17
+ "logits/chosen": 3.8861491680145264,
18
+ "logits/rejected": 5.231001853942871,
19
+ "logps/chosen": -0.9861465692520142,
20
+ "logps/rejected": -1.2529093027114868,
21
+ "loss": 1.953,
22
+ "nll_loss": 3.2415008544921875,
23
+ "rewards/accuracies": 0.6000000238418579,
24
+ "rewards/chosen": -0.04930732399225235,
25
+ "rewards/margins": 0.013338141143321991,
26
+ "rewards/rejected": -0.06264545768499374,
27
+ "step": 10
28
+ },
29
+ {
30
+ "epoch": 0.09,
31
+ "grad_norm": 29184.0,
32
+ "learning_rate": 1.0000000000000002e-06,
33
+ "log_odds_chosen": 0.17107267677783966,
34
+ "log_odds_ratio": -0.6301043033599854,
35
+ "logits/chosen": 4.779696464538574,
36
+ "logits/rejected": 5.251872539520264,
37
+ "logps/chosen": -1.1045284271240234,
38
+ "logps/rejected": -1.2445374727249146,
39
+ "loss": 1.7108,
40
+ "nll_loss": 1.8614288568496704,
41
+ "rewards/accuracies": 0.6000000238418579,
42
+ "rewards/chosen": -0.05522642284631729,
43
+ "rewards/margins": 0.007000453770160675,
44
+ "rewards/rejected": -0.06222687289118767,
45
+ "step": 20
46
+ },
47
+ {
48
+ "epoch": 0.14,
49
+ "grad_norm": 3932160.0,
50
+ "learning_rate": 1.5e-06,
51
+ "log_odds_chosen": 0.478428453207016,
52
+ "log_odds_ratio": -0.5682710409164429,
53
+ "logits/chosen": 4.58956241607666,
54
+ "logits/rejected": 5.215265274047852,
55
+ "logps/chosen": -0.9884525537490845,
56
+ "logps/rejected": -1.2604442834854126,
57
+ "loss": 2.1071,
58
+ "nll_loss": 1.525723934173584,
59
+ "rewards/accuracies": 0.699999988079071,
60
+ "rewards/chosen": -0.04942262917757034,
61
+ "rewards/margins": 0.013599586673080921,
62
+ "rewards/rejected": -0.06302221864461899,
63
+ "step": 30
64
+ },
65
+ {
66
+ "epoch": 0.18,
67
+ "grad_norm": 63232.0,
68
+ "learning_rate": 2.0000000000000003e-06,
69
+ "log_odds_chosen": 0.2763148248195648,
70
+ "log_odds_ratio": -0.6428317427635193,
71
+ "logits/chosen": 5.248695373535156,
72
+ "logits/rejected": 5.335747718811035,
73
+ "logps/chosen": -0.9019734263420105,
74
+ "logps/rejected": -1.058569312095642,
75
+ "loss": 1.6367,
76
+ "nll_loss": 1.17539644241333,
77
+ "rewards/accuracies": 0.6000000238418579,
78
+ "rewards/chosen": -0.045098677277565,
79
+ "rewards/margins": 0.007829795591533184,
80
+ "rewards/rejected": -0.052928466349840164,
81
+ "step": 40
82
+ },
83
+ {
84
+ "epoch": 0.23,
85
+ "grad_norm": 1802240.0,
86
+ "learning_rate": 2.5e-06,
87
+ "log_odds_chosen": -0.07109338045120239,
88
+ "log_odds_ratio": -0.9068069458007812,
89
+ "logits/chosen": 4.34699821472168,
90
+ "logits/rejected": 5.148941993713379,
91
+ "logps/chosen": -1.0307292938232422,
92
+ "logps/rejected": -1.0001896619796753,
93
+ "loss": 2.0499,
94
+ "nll_loss": 2.4581282138824463,
95
+ "rewards/accuracies": 0.699999988079071,
96
+ "rewards/chosen": -0.05153647065162659,
97
+ "rewards/margins": -0.0015269846189767122,
98
+ "rewards/rejected": -0.05000948905944824,
99
+ "step": 50
100
+ },
101
+ {
102
+ "epoch": 0.28,
103
+ "grad_norm": 473088.0,
104
+ "learning_rate": 3e-06,
105
+ "log_odds_chosen": 0.7022095918655396,
106
+ "log_odds_ratio": -0.47877854108810425,
107
+ "logits/chosen": 5.137725830078125,
108
+ "logits/rejected": 5.073107719421387,
109
+ "logps/chosen": -0.7480964660644531,
110
+ "logps/rejected": -1.172572374343872,
111
+ "loss": 1.9116,
112
+ "nll_loss": 1.2398216724395752,
113
+ "rewards/accuracies": 0.800000011920929,
114
+ "rewards/chosen": -0.037404827773571014,
115
+ "rewards/margins": 0.021223794668912888,
116
+ "rewards/rejected": -0.058628618717193604,
117
+ "step": 60
118
+ },
119
+ {
120
+ "epoch": 0.32,
121
+ "grad_norm": 4161536.0,
122
+ "learning_rate": 3.5e-06,
123
+ "log_odds_chosen": -0.30822521448135376,
124
+ "log_odds_ratio": -1.0616459846496582,
125
+ "logits/chosen": 4.378929615020752,
126
+ "logits/rejected": 5.239219665527344,
127
+ "logps/chosen": -1.115562081336975,
128
+ "logps/rejected": -0.8684147596359253,
129
+ "loss": 2.0166,
130
+ "nll_loss": 2.142368793487549,
131
+ "rewards/accuracies": 0.6000000238418579,
132
+ "rewards/chosen": -0.05577809736132622,
133
+ "rewards/margins": -0.012357364408671856,
134
+ "rewards/rejected": -0.043420739471912384,
135
+ "step": 70
136
+ },
137
+ {
138
+ "epoch": 0.37,
139
+ "grad_norm": 211968.0,
140
+ "learning_rate": 4.000000000000001e-06,
141
+ "log_odds_chosen": 0.38707518577575684,
142
+ "log_odds_ratio": -0.5776039361953735,
143
+ "logits/chosen": 5.019408226013184,
144
+ "logits/rejected": 5.3371453285217285,
145
+ "logps/chosen": -0.9375723004341125,
146
+ "logps/rejected": -1.1710981130599976,
147
+ "loss": 1.8918,
148
+ "nll_loss": 1.6274166107177734,
149
+ "rewards/accuracies": 0.6000000238418579,
150
+ "rewards/chosen": -0.046878617256879807,
151
+ "rewards/margins": 0.011676294729113579,
152
+ "rewards/rejected": -0.058554910123348236,
153
+ "step": 80
154
+ },
155
+ {
156
+ "epoch": 0.41,
157
+ "grad_norm": 1704.0,
158
+ "learning_rate": 4.5e-06,
159
+ "log_odds_chosen": 0.5051761865615845,
160
+ "log_odds_ratio": -0.5127500295639038,
161
+ "logits/chosen": 4.478859901428223,
162
+ "logits/rejected": 4.748915672302246,
163
+ "logps/chosen": -0.8121053576469421,
164
+ "logps/rejected": -1.1007237434387207,
165
+ "loss": 1.8015,
166
+ "nll_loss": 1.57842218875885,
167
+ "rewards/accuracies": 0.8999999761581421,
168
+ "rewards/chosen": -0.04060526192188263,
169
+ "rewards/margins": 0.014430919662117958,
170
+ "rewards/rejected": -0.055036187171936035,
171
+ "step": 90
172
+ },
173
+ {
174
+ "epoch": 0.46,
175
+ "grad_norm": 8.9375,
176
+ "learning_rate": 5e-06,
177
+ "log_odds_chosen": 1.037414789199829,
178
+ "log_odds_ratio": -0.3248421549797058,
179
+ "logits/chosen": 4.653676509857178,
180
+ "logits/rejected": 5.350204944610596,
181
+ "logps/chosen": -0.6695261001586914,
182
+ "logps/rejected": -1.300445795059204,
183
+ "loss": 0.9356,
184
+ "nll_loss": 0.7057152986526489,
185
+ "rewards/accuracies": 1.0,
186
+ "rewards/chosen": -0.03347630053758621,
187
+ "rewards/margins": 0.031545985490083694,
188
+ "rewards/rejected": -0.0650222972035408,
189
+ "step": 100
190
+ },
191
+ {
192
+ "epoch": 0.51,
193
+ "grad_norm": 2.8125,
194
+ "learning_rate": 4.767312946227961e-06,
195
+ "log_odds_chosen": 0.6677854061126709,
196
+ "log_odds_ratio": -0.5610898733139038,
197
+ "logits/chosen": 4.671368598937988,
198
+ "logits/rejected": 5.119524002075195,
199
+ "logps/chosen": -0.8684868812561035,
200
+ "logps/rejected": -1.2020485401153564,
201
+ "loss": 0.8288,
202
+ "nll_loss": 0.9947482347488403,
203
+ "rewards/accuracies": 0.699999988079071,
204
+ "rewards/chosen": -0.043424345552921295,
205
+ "rewards/margins": 0.01667807623744011,
206
+ "rewards/rejected": -0.06010241433978081,
207
+ "step": 110
208
+ },
209
+ {
210
+ "epoch": 0.55,
211
+ "grad_norm": 1.9921875,
212
+ "learning_rate": 4.564354645876385e-06,
213
+ "log_odds_chosen": 0.5597886443138123,
214
+ "log_odds_ratio": -0.5193617343902588,
215
+ "logits/chosen": 5.5248026847839355,
216
+ "logits/rejected": 6.067958354949951,
217
+ "logps/chosen": -0.9015194773674011,
218
+ "logps/rejected": -1.2406196594238281,
219
+ "loss": 0.7451,
220
+ "nll_loss": 0.8342186212539673,
221
+ "rewards/accuracies": 0.699999988079071,
222
+ "rewards/chosen": -0.045075975358486176,
223
+ "rewards/margins": 0.01695500686764717,
224
+ "rewards/rejected": -0.06203098222613335,
225
+ "step": 120
226
+ },
227
+ {
228
+ "epoch": 0.6,
229
+ "grad_norm": 2.875,
230
+ "learning_rate": 4.385290096535147e-06,
231
+ "log_odds_chosen": 0.26957136392593384,
232
+ "log_odds_ratio": -0.7732787728309631,
233
+ "logits/chosen": 4.8973588943481445,
234
+ "logits/rejected": 5.552582263946533,
235
+ "logps/chosen": -0.877202033996582,
236
+ "logps/rejected": -0.9518612623214722,
237
+ "loss": 0.7319,
238
+ "nll_loss": 0.6940464377403259,
239
+ "rewards/accuracies": 0.5,
240
+ "rewards/chosen": -0.04386010393500328,
241
+ "rewards/margins": 0.0037329583428800106,
242
+ "rewards/rejected": -0.04759306460618973,
243
+ "step": 130
244
+ },
245
+ {
246
+ "epoch": 0.65,
247
+ "grad_norm": 2.5,
248
+ "learning_rate": 4.2257712736425835e-06,
249
+ "log_odds_chosen": 0.7680839896202087,
250
+ "log_odds_ratio": -0.5321913957595825,
251
+ "logits/chosen": 5.471996307373047,
252
+ "logits/rejected": 5.644137382507324,
253
+ "logps/chosen": -0.6714180111885071,
254
+ "logps/rejected": -0.9587985277175903,
255
+ "loss": 0.732,
256
+ "nll_loss": 0.6440631151199341,
257
+ "rewards/accuracies": 0.699999988079071,
258
+ "rewards/chosen": -0.033570900559425354,
259
+ "rewards/margins": 0.014369020238518715,
260
+ "rewards/rejected": -0.047939930111169815,
261
+ "step": 140
262
+ },
263
+ {
264
+ "epoch": 0.69,
265
+ "grad_norm": 2.375,
266
+ "learning_rate": 4.082482904638631e-06,
267
+ "log_odds_chosen": 0.5068908929824829,
268
+ "log_odds_ratio": -0.604145884513855,
269
+ "logits/chosen": 5.474297523498535,
270
+ "logits/rejected": 5.376832485198975,
271
+ "logps/chosen": -0.8357957005500793,
272
+ "logps/rejected": -1.0136160850524902,
273
+ "loss": 0.706,
274
+ "nll_loss": 0.6737378835678101,
275
+ "rewards/accuracies": 0.699999988079071,
276
+ "rewards/chosen": -0.04178978502750397,
277
+ "rewards/margins": 0.00889101903885603,
278
+ "rewards/rejected": -0.05068080872297287,
279
+ "step": 150
280
+ },
281
+ {
282
+ "epoch": 0.74,
283
+ "grad_norm": 2.015625,
284
+ "learning_rate": 3.952847075210474e-06,
285
+ "log_odds_chosen": 0.615983784198761,
286
+ "log_odds_ratio": -0.4876289963722229,
287
+ "logits/chosen": 5.473410129547119,
288
+ "logits/rejected": 6.06318998336792,
289
+ "logps/chosen": -0.9676389694213867,
290
+ "logps/rejected": -1.349385142326355,
291
+ "loss": 0.6996,
292
+ "nll_loss": 0.6852242350578308,
293
+ "rewards/accuracies": 0.8999999761581421,
294
+ "rewards/chosen": -0.048381954431533813,
295
+ "rewards/margins": 0.019087309017777443,
296
+ "rewards/rejected": -0.06746925413608551,
297
+ "step": 160
298
+ },
299
+ {
300
+ "epoch": 0.78,
301
+ "grad_norm": 2.171875,
302
+ "learning_rate": 3.834824944236852e-06,
303
+ "log_odds_chosen": 0.4551977515220642,
304
+ "log_odds_ratio": -0.5428072214126587,
305
+ "logits/chosen": 4.785284042358398,
306
+ "logits/rejected": 6.005092620849609,
307
+ "logps/chosen": -0.7350739240646362,
308
+ "logps/rejected": -1.0496256351470947,
309
+ "loss": 0.6959,
310
+ "nll_loss": 0.5339438319206238,
311
+ "rewards/accuracies": 0.699999988079071,
312
+ "rewards/chosen": -0.03675369173288345,
313
+ "rewards/margins": 0.015727588906884193,
314
+ "rewards/rejected": -0.052481282502412796,
315
+ "step": 170
316
+ },
317
+ {
318
+ "epoch": 0.83,
319
+ "grad_norm": 2.015625,
320
+ "learning_rate": 3.72677996249965e-06,
321
+ "log_odds_chosen": 0.5587902665138245,
322
+ "log_odds_ratio": -0.6063727140426636,
323
+ "logits/chosen": 4.6595892906188965,
324
+ "logits/rejected": 5.4700422286987305,
325
+ "logps/chosen": -0.7482207417488098,
326
+ "logps/rejected": -0.9887701272964478,
327
+ "loss": 0.7233,
328
+ "nll_loss": 0.5874465703964233,
329
+ "rewards/accuracies": 0.6000000238418579,
330
+ "rewards/chosen": -0.03741103783249855,
331
+ "rewards/margins": 0.012027469463646412,
332
+ "rewards/rejected": -0.04943850636482239,
333
+ "step": 180
334
+ },
335
+ {
336
+ "epoch": 0.88,
337
+ "grad_norm": 2.0625,
338
+ "learning_rate": 3.6273812505500587e-06,
339
+ "log_odds_chosen": 0.9965683817863464,
340
+ "log_odds_ratio": -0.4162277281284332,
341
+ "logits/chosen": 5.304540157318115,
342
+ "logits/rejected": 5.486930847167969,
343
+ "logps/chosen": -0.7579169869422913,
344
+ "logps/rejected": -1.1843591928482056,
345
+ "loss": 0.7298,
346
+ "nll_loss": 0.6787526607513428,
347
+ "rewards/accuracies": 0.8999999761581421,
348
+ "rewards/chosen": -0.03789585083723068,
349
+ "rewards/margins": 0.021322116255760193,
350
+ "rewards/rejected": -0.059217967092990875,
351
+ "step": 190
352
+ },
353
+ {
354
+ "epoch": 0.92,
355
+ "grad_norm": 1.9609375,
356
+ "learning_rate": 3.5355339059327378e-06,
357
+ "log_odds_chosen": 0.2911016047000885,
358
+ "log_odds_ratio": -0.6208275556564331,
359
+ "logits/chosen": 5.865508556365967,
360
+ "logits/rejected": 5.9140448570251465,
361
+ "logps/chosen": -1.0318800210952759,
362
+ "logps/rejected": -1.2233208417892456,
363
+ "loss": 0.6888,
364
+ "nll_loss": 0.8277570009231567,
365
+ "rewards/accuracies": 0.699999988079071,
366
+ "rewards/chosen": -0.051594000309705734,
367
+ "rewards/margins": 0.009572046808898449,
368
+ "rewards/rejected": -0.061166055500507355,
369
+ "step": 200
370
+ },
371
+ {
372
+ "epoch": 0.97,
373
+ "grad_norm": 2.546875,
374
+ "learning_rate": 3.450327796711771e-06,
375
+ "log_odds_chosen": 0.3929597735404968,
376
+ "log_odds_ratio": -0.6252869367599487,
377
+ "logits/chosen": 5.480368137359619,
378
+ "logits/rejected": 5.818605899810791,
379
+ "logps/chosen": -0.8382253646850586,
380
+ "logps/rejected": -1.1194109916687012,
381
+ "loss": 0.703,
382
+ "nll_loss": 0.7914389967918396,
383
+ "rewards/accuracies": 0.5,
384
+ "rewards/chosen": -0.04191126674413681,
385
+ "rewards/margins": 0.014059278182685375,
386
+ "rewards/rejected": -0.05597054958343506,
387
+ "step": 210
388
+ },
389
+ {
390
+ "epoch": 1.01,
391
+ "grad_norm": 2.234375,
392
+ "learning_rate": 3.3709993123162106e-06,
393
+ "log_odds_chosen": 1.1686198711395264,
394
+ "log_odds_ratio": -0.39844751358032227,
395
+ "logits/chosen": 4.818378448486328,
396
+ "logits/rejected": 5.660789966583252,
397
+ "logps/chosen": -0.5040851831436157,
398
+ "logps/rejected": -0.9685913324356079,
399
+ "loss": 0.6554,
400
+ "nll_loss": 0.49605101346969604,
401
+ "rewards/accuracies": 0.8999999761581421,
402
+ "rewards/chosen": -0.025204259902238846,
403
+ "rewards/margins": 0.02322530373930931,
404
+ "rewards/rejected": -0.04842956364154816,
405
+ "step": 220
406
+ },
407
+ {
408
+ "epoch": 1.06,
409
+ "grad_norm": 2.046875,
410
+ "learning_rate": 3.296902366978936e-06,
411
+ "log_odds_chosen": 0.7159255743026733,
412
+ "log_odds_ratio": -0.5276229977607727,
413
+ "logits/chosen": 4.3275017738342285,
414
+ "logits/rejected": 5.1829423904418945,
415
+ "logps/chosen": -0.7593253254890442,
416
+ "logps/rejected": -1.0148638486862183,
417
+ "loss": 0.6289,
418
+ "nll_loss": 0.609928548336029,
419
+ "rewards/accuracies": 0.800000011920929,
420
+ "rewards/chosen": -0.03796626627445221,
421
+ "rewards/margins": 0.012776928022503853,
422
+ "rewards/rejected": -0.05074319988489151,
423
+ "step": 230
424
+ },
425
+ {
426
+ "epoch": 1.11,
427
+ "grad_norm": 2.375,
428
+ "learning_rate": 3.2274861218395142e-06,
429
+ "log_odds_chosen": 0.7326894998550415,
430
+ "log_odds_ratio": -0.5214331150054932,
431
+ "logits/chosen": 4.783654689788818,
432
+ "logits/rejected": 5.283537864685059,
433
+ "logps/chosen": -0.7465990781784058,
434
+ "logps/rejected": -0.9910147786140442,
435
+ "loss": 0.6382,
436
+ "nll_loss": 0.7347540855407715,
437
+ "rewards/accuracies": 0.699999988079071,
438
+ "rewards/chosen": -0.03732995316386223,
439
+ "rewards/margins": 0.012220785021781921,
440
+ "rewards/rejected": -0.04955074191093445,
441
+ "step": 240
442
+ },
443
+ {
444
+ "epoch": 1.15,
445
+ "grad_norm": 2.0625,
446
+ "learning_rate": 3.1622776601683796e-06,
447
+ "log_odds_chosen": 0.040362291038036346,
448
+ "log_odds_ratio": -0.7654204964637756,
449
+ "logits/chosen": 4.929324150085449,
450
+ "logits/rejected": 4.940483570098877,
451
+ "logps/chosen": -0.939703106880188,
452
+ "logps/rejected": -0.9395262598991394,
453
+ "loss": 0.6626,
454
+ "nll_loss": 0.7169132232666016,
455
+ "rewards/accuracies": 0.4000000059604645,
456
+ "rewards/chosen": -0.04698516055941582,
457
+ "rewards/margins": -8.843839168548584e-06,
458
+ "rewards/rejected": -0.04697632044553757,
459
+ "step": 250
460
+ },
461
+ {
462
+ "epoch": 1.2,
463
+ "grad_norm": 2.59375,
464
+ "learning_rate": 3.1008683647302113e-06,
465
+ "log_odds_chosen": 0.8304751515388489,
466
+ "log_odds_ratio": -0.4627406597137451,
467
+ "logits/chosen": 4.34907341003418,
468
+ "logits/rejected": 4.541801929473877,
469
+ "logps/chosen": -0.7797168493270874,
470
+ "logps/rejected": -1.0878037214279175,
471
+ "loss": 0.6408,
472
+ "nll_loss": 0.6424815058708191,
473
+ "rewards/accuracies": 0.800000011920929,
474
+ "rewards/chosen": -0.03898584097623825,
475
+ "rewards/margins": 0.015404346399009228,
476
+ "rewards/rejected": -0.05439019203186035,
477
+ "step": 260
478
+ },
479
+ {
480
+ "epoch": 1.24,
481
+ "grad_norm": 2.328125,
482
+ "learning_rate": 3.0429030972509227e-06,
483
+ "log_odds_chosen": 0.2547241747379303,
484
+ "log_odds_ratio": -0.7041358351707458,
485
+ "logits/chosen": 4.1212077140808105,
486
+ "logits/rejected": 5.139257431030273,
487
+ "logps/chosen": -0.5988011360168457,
488
+ "logps/rejected": -0.7647382020950317,
489
+ "loss": 0.6441,
490
+ "nll_loss": 0.4384763836860657,
491
+ "rewards/accuracies": 0.6000000238418579,
492
+ "rewards/chosen": -0.029940057545900345,
493
+ "rewards/margins": 0.008296851068735123,
494
+ "rewards/rejected": -0.03823690861463547,
495
+ "step": 270
496
+ },
497
+ {
498
+ "epoch": 1.29,
499
+ "grad_norm": 1.9609375,
500
+ "learning_rate": 2.988071523335984e-06,
501
+ "log_odds_chosen": 0.7432643175125122,
502
+ "log_odds_ratio": -0.4928904175758362,
503
+ "logits/chosen": 4.240169525146484,
504
+ "logits/rejected": 4.746310234069824,
505
+ "logps/chosen": -0.7583116292953491,
506
+ "logps/rejected": -1.0217373371124268,
507
+ "loss": 0.6349,
508
+ "nll_loss": 0.5912537574768066,
509
+ "rewards/accuracies": 0.6000000238418579,
510
+ "rewards/chosen": -0.03791557624936104,
511
+ "rewards/margins": 0.013171288184821606,
512
+ "rewards/rejected": -0.05108686536550522,
513
+ "step": 280
514
+ },
515
+ {
516
+ "epoch": 1.34,
517
+ "grad_norm": 2.5,
518
+ "learning_rate": 2.9361010975735177e-06,
519
+ "log_odds_chosen": 0.6404408812522888,
520
+ "log_odds_ratio": -0.5461726784706116,
521
+ "logits/chosen": 4.347890377044678,
522
+ "logits/rejected": 5.2955708503723145,
523
+ "logps/chosen": -0.8145158886909485,
524
+ "logps/rejected": -1.124975323677063,
525
+ "loss": 0.6204,
526
+ "nll_loss": 0.5651360154151917,
527
+ "rewards/accuracies": 0.6000000238418579,
528
+ "rewards/chosen": -0.040725789964199066,
529
+ "rewards/margins": 0.015522971749305725,
530
+ "rewards/rejected": -0.05624876171350479,
531
+ "step": 290
532
+ },
533
+ {
534
+ "epoch": 1.38,
535
+ "grad_norm": 1.96875,
536
+ "learning_rate": 2.8867513459481293e-06,
537
+ "log_odds_chosen": 0.4704459607601166,
538
+ "log_odds_ratio": -0.6623938083648682,
539
+ "logits/chosen": 4.255876064300537,
540
+ "logits/rejected": 5.063040733337402,
541
+ "logps/chosen": -0.7718355059623718,
542
+ "logps/rejected": -1.144460916519165,
543
+ "loss": 0.6404,
544
+ "nll_loss": 0.6724303364753723,
545
+ "rewards/accuracies": 0.699999988079071,
546
+ "rewards/chosen": -0.03859177231788635,
547
+ "rewards/margins": 0.01863126829266548,
548
+ "rewards/rejected": -0.05722304433584213,
549
+ "step": 300
550
+ },
551
+ {
552
+ "epoch": 1.43,
553
+ "grad_norm": 2.453125,
554
+ "learning_rate": 2.839809171235324e-06,
555
+ "log_odds_chosen": 1.5952459573745728,
556
+ "log_odds_ratio": -0.2707791328430176,
557
+ "logits/chosen": 2.7694969177246094,
558
+ "logits/rejected": 5.479510307312012,
559
+ "logps/chosen": -0.4962679445743561,
560
+ "logps/rejected": -1.2316776514053345,
561
+ "loss": 0.6428,
562
+ "nll_loss": 0.3623020648956299,
563
+ "rewards/accuracies": 1.0,
564
+ "rewards/chosen": -0.024813394993543625,
565
+ "rewards/margins": 0.03677048534154892,
566
+ "rewards/rejected": -0.06158388406038284,
567
+ "step": 310
568
+ },
569
+ {
570
+ "epoch": 1.47,
571
+ "grad_norm": 2.078125,
572
+ "learning_rate": 2.7950849718747376e-06,
573
+ "log_odds_chosen": 0.4402007460594177,
574
+ "log_odds_ratio": -0.5388344526290894,
575
+ "logits/chosen": 4.8701372146606445,
576
+ "logits/rejected": 4.049181938171387,
577
+ "logps/chosen": -0.8427563905715942,
578
+ "logps/rejected": -1.1280080080032349,
579
+ "loss": 0.6661,
580
+ "nll_loss": 0.6774541735649109,
581
+ "rewards/accuracies": 0.699999988079071,
582
+ "rewards/chosen": -0.04213782027363777,
583
+ "rewards/margins": 0.014262576587498188,
584
+ "rewards/rejected": -0.056400395929813385,
585
+ "step": 320
586
+ },
587
+ {
588
+ "epoch": 1.52,
589
+ "grad_norm": 1.9765625,
590
+ "learning_rate": 2.752409412815902e-06,
591
+ "log_odds_chosen": 1.4536019563674927,
592
+ "log_odds_ratio": -0.3178521990776062,
593
+ "logits/chosen": 4.046222686767578,
594
+ "logits/rejected": 4.855486869812012,
595
+ "logps/chosen": -0.4614998400211334,
596
+ "logps/rejected": -1.0025476217269897,
597
+ "loss": 0.6396,
598
+ "nll_loss": 0.46759381890296936,
599
+ "rewards/accuracies": 0.8999999761581421,
600
+ "rewards/chosen": -0.02307499200105667,
601
+ "rewards/margins": 0.027052391320466995,
602
+ "rewards/rejected": -0.05012737959623337,
603
+ "step": 330
604
+ },
605
+ {
606
+ "epoch": 1.57,
607
+ "grad_norm": 2.65625,
608
+ "learning_rate": 2.711630722733202e-06,
609
+ "log_odds_chosen": 0.4552677273750305,
610
+ "log_odds_ratio": -0.5441101789474487,
611
+ "logits/chosen": 4.233187198638916,
612
+ "logits/rejected": 4.776756286621094,
613
+ "logps/chosen": -0.9984881281852722,
614
+ "logps/rejected": -1.3039405345916748,
615
+ "loss": 0.6326,
616
+ "nll_loss": 0.7266319990158081,
617
+ "rewards/accuracies": 0.699999988079071,
618
+ "rewards/chosen": -0.04992440715432167,
619
+ "rewards/margins": 0.01527262944728136,
620
+ "rewards/rejected": -0.06519703567028046,
621
+ "step": 340
622
+ },
623
+ {
624
+ "epoch": 1.61,
625
+ "grad_norm": 1.9609375,
626
+ "learning_rate": 2.6726124191242444e-06,
627
+ "log_odds_chosen": 0.3951299488544464,
628
+ "log_odds_ratio": -0.6442996263504028,
629
+ "logits/chosen": 4.592418193817139,
630
+ "logits/rejected": 4.885247707366943,
631
+ "logps/chosen": -0.9690208435058594,
632
+ "logps/rejected": -1.1191128492355347,
633
+ "loss": 0.6271,
634
+ "nll_loss": 0.7028160095214844,
635
+ "rewards/accuracies": 0.6000000238418579,
636
+ "rewards/chosen": -0.04845104366540909,
637
+ "rewards/margins": 0.007504602428525686,
638
+ "rewards/rejected": -0.055955640971660614,
639
+ "step": 350
640
+ },
641
+ {
642
+ "epoch": 1.66,
643
+ "grad_norm": 2.109375,
644
+ "learning_rate": 2.6352313834736496e-06,
645
+ "log_odds_chosen": 0.6397253274917603,
646
+ "log_odds_ratio": -0.4948647916316986,
647
+ "logits/chosen": 3.1035220623016357,
648
+ "logits/rejected": 4.4074320793151855,
649
+ "logps/chosen": -0.7063679695129395,
650
+ "logps/rejected": -1.086042881011963,
651
+ "loss": 0.6133,
652
+ "nll_loss": 0.5209956765174866,
653
+ "rewards/accuracies": 0.800000011920929,
654
+ "rewards/chosen": -0.03531839698553085,
655
+ "rewards/margins": 0.0189837496727705,
656
+ "rewards/rejected": -0.054302144795656204,
657
+ "step": 360
658
+ },
659
+ {
660
+ "epoch": 1.71,
661
+ "grad_norm": 1.9296875,
662
+ "learning_rate": 2.599376224550182e-06,
663
+ "log_odds_chosen": 0.5072129368782043,
664
+ "log_odds_ratio": -0.5375211834907532,
665
+ "logits/chosen": 4.4618144035339355,
666
+ "logits/rejected": 4.897726535797119,
667
+ "logps/chosen": -0.8658114671707153,
668
+ "logps/rejected": -1.161678433418274,
669
+ "loss": 0.625,
670
+ "nll_loss": 0.7147814035415649,
671
+ "rewards/accuracies": 0.800000011920929,
672
+ "rewards/chosen": -0.043290577828884125,
673
+ "rewards/margins": 0.014793348498642445,
674
+ "rewards/rejected": -0.058083921670913696,
675
+ "step": 370
676
+ },
677
+ {
678
+ "epoch": 1.75,
679
+ "grad_norm": 2.28125,
680
+ "learning_rate": 2.564945880212886e-06,
681
+ "log_odds_chosen": 0.5736058950424194,
682
+ "log_odds_ratio": -0.4948197305202484,
683
+ "logits/chosen": 4.31764554977417,
684
+ "logits/rejected": 4.153486251831055,
685
+ "logps/chosen": -0.8540223836898804,
686
+ "logps/rejected": -1.1471771001815796,
687
+ "loss": 0.6393,
688
+ "nll_loss": 0.6763076186180115,
689
+ "rewards/accuracies": 0.699999988079071,
690
+ "rewards/chosen": -0.0427011176943779,
691
+ "rewards/margins": 0.014657735824584961,
692
+ "rewards/rejected": -0.05735884979367256,
693
+ "step": 380
694
+ },
695
+ {
696
+ "epoch": 1.8,
697
+ "grad_norm": 3.640625,
698
+ "learning_rate": 2.5318484177091667e-06,
699
+ "log_odds_chosen": 0.8381564021110535,
700
+ "log_odds_ratio": -0.5308811068534851,
701
+ "logits/chosen": 4.037534236907959,
702
+ "logits/rejected": 5.888669013977051,
703
+ "logps/chosen": -0.700161337852478,
704
+ "logps/rejected": -1.2042081356048584,
705
+ "loss": 0.6318,
706
+ "nll_loss": 0.5512461066246033,
707
+ "rewards/accuracies": 0.800000011920929,
708
+ "rewards/chosen": -0.03500806540250778,
709
+ "rewards/margins": 0.025202345103025436,
710
+ "rewards/rejected": -0.060210417956113815,
711
+ "step": 390
712
+ },
713
+ {
714
+ "epoch": 1.84,
715
+ "grad_norm": 2.171875,
716
+ "learning_rate": 2.5e-06,
717
+ "log_odds_chosen": 0.7038768529891968,
718
+ "log_odds_ratio": -0.43052348494529724,
719
+ "logits/chosen": 3.822885036468506,
720
+ "logits/rejected": 4.210227012634277,
721
+ "logps/chosen": -0.6150542497634888,
722
+ "logps/rejected": -0.9889954328536987,
723
+ "loss": 0.6218,
724
+ "nll_loss": 0.5013046264648438,
725
+ "rewards/accuracies": 0.8999999761581421,
726
+ "rewards/chosen": -0.030752714723348618,
727
+ "rewards/margins": 0.01869705691933632,
728
+ "rewards/rejected": -0.04944976791739464,
729
+ "step": 400
730
+ },
731
+ {
732
+ "epoch": 1.89,
733
+ "grad_norm": 2.03125,
734
+ "learning_rate": 2.4693239916239746e-06,
735
+ "log_odds_chosen": 0.49417972564697266,
736
+ "log_odds_ratio": -0.5454962253570557,
737
+ "logits/chosen": 3.7158710956573486,
738
+ "logits/rejected": 4.625822067260742,
739
+ "logps/chosen": -0.7136448621749878,
740
+ "logps/rejected": -0.9806584119796753,
741
+ "loss": 0.6163,
742
+ "nll_loss": 0.5766875147819519,
743
+ "rewards/accuracies": 0.800000011920929,
744
+ "rewards/chosen": -0.03568224236369133,
745
+ "rewards/margins": 0.013350683264434338,
746
+ "rewards/rejected": -0.04903292655944824,
747
+ "step": 410
748
+ },
749
+ {
750
+ "epoch": 1.94,
751
+ "grad_norm": 1.96875,
752
+ "learning_rate": 2.4397501823713327e-06,
753
+ "log_odds_chosen": 1.2905668020248413,
754
+ "log_odds_ratio": -0.3054632544517517,
755
+ "logits/chosen": 4.375031471252441,
756
+ "logits/rejected": 5.165828704833984,
757
+ "logps/chosen": -0.6634560823440552,
758
+ "logps/rejected": -1.2297804355621338,
759
+ "loss": 0.6299,
760
+ "nll_loss": 0.5654190182685852,
761
+ "rewards/accuracies": 0.8999999761581421,
762
+ "rewards/chosen": -0.03317280486226082,
763
+ "rewards/margins": 0.02831621840596199,
764
+ "rewards/rejected": -0.06148902326822281,
765
+ "step": 420
766
+ },
767
+ {
768
+ "epoch": 1.98,
769
+ "grad_norm": 2.234375,
770
+ "learning_rate": 2.411214110852061e-06,
771
+ "log_odds_chosen": 0.4614163041114807,
772
+ "log_odds_ratio": -0.5477044582366943,
773
+ "logits/chosen": 3.945091724395752,
774
+ "logits/rejected": 4.783943176269531,
775
+ "logps/chosen": -0.670985758304596,
776
+ "logps/rejected": -0.8528381586074829,
777
+ "loss": 0.6328,
778
+ "nll_loss": 0.5353778004646301,
779
+ "rewards/accuracies": 0.800000011920929,
780
+ "rewards/chosen": -0.03354928642511368,
781
+ "rewards/margins": 0.009092616848647594,
782
+ "rewards/rejected": -0.04264190047979355,
783
+ "step": 430
784
+ },
785
+ {
786
+ "epoch": 2.03,
787
+ "grad_norm": 2.140625,
788
+ "learning_rate": 2.3836564731139807e-06,
789
+ "log_odds_chosen": 0.519318699836731,
790
+ "log_odds_ratio": -0.5034213066101074,
791
+ "logits/chosen": 3.990828037261963,
792
+ "logits/rejected": 4.283727645874023,
793
+ "logps/chosen": -0.7843809723854065,
794
+ "logps/rejected": -1.1084554195404053,
795
+ "loss": 0.598,
796
+ "nll_loss": 0.6064985394477844,
797
+ "rewards/accuracies": 0.699999988079071,
798
+ "rewards/chosen": -0.03921904414892197,
799
+ "rewards/margins": 0.01620371639728546,
800
+ "rewards/rejected": -0.055422764271497726,
801
+ "step": 440
802
+ },
803
+ {
804
+ "epoch": 2.07,
805
+ "grad_norm": 2.015625,
806
+ "learning_rate": 2.357022603955159e-06,
807
+ "log_odds_chosen": 1.2161670923233032,
808
+ "log_odds_ratio": -0.5558447241783142,
809
+ "logits/chosen": 2.7631869316101074,
810
+ "logits/rejected": 4.014997959136963,
811
+ "logps/chosen": -0.4891352653503418,
812
+ "logps/rejected": -1.057556390762329,
813
+ "loss": 0.6063,
814
+ "nll_loss": 0.5005042552947998,
815
+ "rewards/accuracies": 0.800000011920929,
816
+ "rewards/chosen": -0.02445676364004612,
817
+ "rewards/margins": 0.028421055525541306,
818
+ "rewards/rejected": -0.052877821028232574,
819
+ "step": 450
820
+ },
821
+ {
822
+ "epoch": 2.12,
823
+ "grad_norm": 2.0625,
824
+ "learning_rate": 2.3312620206007847e-06,
825
+ "log_odds_chosen": 0.8278636932373047,
826
+ "log_odds_ratio": -0.43884754180908203,
827
+ "logits/chosen": 4.009448051452637,
828
+ "logits/rejected": 4.671367645263672,
829
+ "logps/chosen": -0.7134698629379272,
830
+ "logps/rejected": -1.146784782409668,
831
+ "loss": 0.5862,
832
+ "nll_loss": 0.5619599223136902,
833
+ "rewards/accuracies": 0.800000011920929,
834
+ "rewards/chosen": -0.03567349165678024,
835
+ "rewards/margins": 0.021665748208761215,
836
+ "rewards/rejected": -0.05733924359083176,
837
+ "step": 460
838
+ },
839
+ {
840
+ "epoch": 2.17,
841
+ "grad_norm": 2.609375,
842
+ "learning_rate": 2.3063280200722128e-06,
843
+ "log_odds_chosen": 1.677671194076538,
844
+ "log_odds_ratio": -0.2895694375038147,
845
+ "logits/chosen": 2.985790491104126,
846
+ "logits/rejected": 4.190914630889893,
847
+ "logps/chosen": -0.5018793344497681,
848
+ "logps/rejected": -1.0572091341018677,
849
+ "loss": 0.5765,
850
+ "nll_loss": 0.5001329183578491,
851
+ "rewards/accuracies": 0.8999999761581421,
852
+ "rewards/chosen": -0.025093963369727135,
853
+ "rewards/margins": 0.02776649035513401,
854
+ "rewards/rejected": -0.05286044999957085,
855
+ "step": 470
856
+ },
857
+ {
858
+ "epoch": 2.21,
859
+ "grad_norm": 2.0,
860
+ "learning_rate": 2.2821773229381924e-06,
861
+ "log_odds_chosen": 1.0791471004486084,
862
+ "log_odds_ratio": -0.37350553274154663,
863
+ "logits/chosen": 3.676426649093628,
864
+ "logits/rejected": 3.8374907970428467,
865
+ "logps/chosen": -0.7438164353370667,
866
+ "logps/rejected": -1.29355788230896,
867
+ "loss": 0.5652,
868
+ "nll_loss": 0.6555451154708862,
869
+ "rewards/accuracies": 0.800000011920929,
870
+ "rewards/chosen": -0.037190817296504974,
871
+ "rewards/margins": 0.027487074956297874,
872
+ "rewards/rejected": -0.0646779015660286,
873
+ "step": 480
874
+ },
875
+ {
876
+ "epoch": 2.26,
877
+ "grad_norm": 2.21875,
878
+ "learning_rate": 2.2587697572631284e-06,
879
+ "log_odds_chosen": 0.275502473115921,
880
+ "log_odds_ratio": -0.7135687470436096,
881
+ "logits/chosen": 4.321534156799316,
882
+ "logits/rejected": 4.41732120513916,
883
+ "logps/chosen": -0.9727070927619934,
884
+ "logps/rejected": -1.0810346603393555,
885
+ "loss": 0.5952,
886
+ "nll_loss": 0.7110171914100647,
887
+ "rewards/accuracies": 0.6000000238418579,
888
+ "rewards/chosen": -0.04863535612821579,
889
+ "rewards/margins": 0.005416377447545528,
890
+ "rewards/rejected": -0.05405173450708389,
891
+ "step": 490
892
+ },
893
+ {
894
+ "epoch": 2.3,
895
+ "grad_norm": 2.15625,
896
+ "learning_rate": 2.23606797749979e-06,
897
+ "log_odds_chosen": 0.34863442182540894,
898
+ "log_odds_ratio": -0.6463712453842163,
899
+ "logits/chosen": 4.6876606941223145,
900
+ "logits/rejected": 5.054124355316162,
901
+ "logps/chosen": -0.9338000416755676,
902
+ "logps/rejected": -1.1037800312042236,
903
+ "loss": 0.5953,
904
+ "nll_loss": 0.8528131246566772,
905
+ "rewards/accuracies": 0.699999988079071,
906
+ "rewards/chosen": -0.04669000208377838,
907
+ "rewards/margins": 0.008499005809426308,
908
+ "rewards/rejected": -0.05518900603055954,
909
+ "step": 500
910
+ },
911
+ {
912
+ "epoch": 2.35,
913
+ "grad_norm": 2.171875,
914
+ "learning_rate": 2.2140372138502386e-06,
915
+ "log_odds_chosen": 0.9548345804214478,
916
+ "log_odds_ratio": -0.39882007241249084,
917
+ "logits/chosen": 3.5289406776428223,
918
+ "logits/rejected": 3.8287463188171387,
919
+ "logps/chosen": -0.6570809483528137,
920
+ "logps/rejected": -1.1388274431228638,
921
+ "loss": 0.609,
922
+ "nll_loss": 0.5968061685562134,
923
+ "rewards/accuracies": 0.8999999761581421,
924
+ "rewards/chosen": -0.032854050397872925,
925
+ "rewards/margins": 0.024087321013212204,
926
+ "rewards/rejected": -0.05694136768579483,
927
+ "step": 510
928
+ },
929
+ {
930
+ "epoch": 2.4,
931
+ "grad_norm": 1.9609375,
932
+ "learning_rate": 2.1926450482675734e-06,
933
+ "log_odds_chosen": 0.4539831280708313,
934
+ "log_odds_ratio": -0.5872747302055359,
935
+ "logits/chosen": 3.2061939239501953,
936
+ "logits/rejected": 4.589787006378174,
937
+ "logps/chosen": -0.7979894280433655,
938
+ "logps/rejected": -1.0285401344299316,
939
+ "loss": 0.5827,
940
+ "nll_loss": 0.6084668636322021,
941
+ "rewards/accuracies": 0.699999988079071,
942
+ "rewards/chosen": -0.039899468421936035,
943
+ "rewards/margins": 0.011527536436915398,
944
+ "rewards/rejected": -0.051426999270915985,
945
+ "step": 520
946
+ },
947
+ {
948
+ "epoch": 2.44,
949
+ "grad_norm": 2.484375,
950
+ "learning_rate": 2.1718612138153473e-06,
951
+ "log_odds_chosen": 0.8493059277534485,
952
+ "log_odds_ratio": -0.6372500658035278,
953
+ "logits/chosen": 3.078615665435791,
954
+ "logits/rejected": 4.099945068359375,
955
+ "logps/chosen": -0.6704202890396118,
956
+ "logps/rejected": -0.7899671792984009,
957
+ "loss": 0.5788,
958
+ "nll_loss": 0.5733928084373474,
959
+ "rewards/accuracies": 0.5,
960
+ "rewards/chosen": -0.03352101519703865,
961
+ "rewards/margins": 0.005977341439574957,
962
+ "rewards/rejected": -0.039498358964920044,
963
+ "step": 530
964
+ },
965
+ {
966
+ "epoch": 2.49,
967
+ "grad_norm": 1.859375,
968
+ "learning_rate": 2.151657414559676e-06,
969
+ "log_odds_chosen": 0.6374627351760864,
970
+ "log_odds_ratio": -0.5592355728149414,
971
+ "logits/chosen": 3.680483341217041,
972
+ "logits/rejected": 3.9816291332244873,
973
+ "logps/chosen": -0.8559755086898804,
974
+ "logps/rejected": -1.1612054109573364,
975
+ "loss": 0.6003,
976
+ "nll_loss": 0.6403124928474426,
977
+ "rewards/accuracies": 0.6000000238418579,
978
+ "rewards/chosen": -0.04279877990484238,
979
+ "rewards/margins": 0.015261486172676086,
980
+ "rewards/rejected": -0.05806026607751846,
981
+ "step": 540
982
+ },
983
+ {
984
+ "epoch": 2.53,
985
+ "grad_norm": 1.8984375,
986
+ "learning_rate": 2.132007163556104e-06,
987
+ "log_odds_chosen": 1.399209976196289,
988
+ "log_odds_ratio": -0.5735031366348267,
989
+ "logits/chosen": 3.132289171218872,
990
+ "logits/rejected": 3.5427193641662598,
991
+ "logps/chosen": -0.5963010191917419,
992
+ "logps/rejected": -0.9639393091201782,
993
+ "loss": 0.5984,
994
+ "nll_loss": 0.5058175325393677,
995
+ "rewards/accuracies": 0.6000000238418579,
996
+ "rewards/chosen": -0.029815051704645157,
997
+ "rewards/margins": 0.018381912261247635,
998
+ "rewards/rejected": -0.04819696769118309,
999
+ "step": 550
1000
+ },
1001
+ {
1002
+ "epoch": 2.58,
1003
+ "grad_norm": 1.859375,
1004
+ "learning_rate": 2.1128856368212917e-06,
1005
+ "log_odds_chosen": 0.688880443572998,
1006
+ "log_odds_ratio": -0.4902462959289551,
1007
+ "logits/chosen": 2.6950721740722656,
1008
+ "logits/rejected": 3.1528286933898926,
1009
+ "logps/chosen": -0.6383022665977478,
1010
+ "logps/rejected": -0.9691828489303589,
1011
+ "loss": 0.5718,
1012
+ "nll_loss": 0.4289799630641937,
1013
+ "rewards/accuracies": 0.800000011920929,
1014
+ "rewards/chosen": -0.03191510960459709,
1015
+ "rewards/margins": 0.016544032841920853,
1016
+ "rewards/rejected": -0.048459142446517944,
1017
+ "step": 560
1018
+ },
1019
+ {
1020
+ "epoch": 2.63,
1021
+ "grad_norm": 2.421875,
1022
+ "learning_rate": 2.0942695414584777e-06,
1023
+ "log_odds_chosen": 1.3283271789550781,
1024
+ "log_odds_ratio": -0.3012233078479767,
1025
+ "logits/chosen": 3.4564871788024902,
1026
+ "logits/rejected": 4.7043867111206055,
1027
+ "logps/chosen": -0.6779360771179199,
1028
+ "logps/rejected": -1.523970365524292,
1029
+ "loss": 0.6138,
1030
+ "nll_loss": 0.5768535137176514,
1031
+ "rewards/accuracies": 1.0,
1032
+ "rewards/chosen": -0.033896803855895996,
1033
+ "rewards/margins": 0.042301714420318604,
1034
+ "rewards/rejected": -0.0761985182762146,
1035
+ "step": 570
1036
+ },
1037
+ {
1038
+ "epoch": 2.67,
1039
+ "grad_norm": 1.953125,
1040
+ "learning_rate": 2.0761369963434992e-06,
1041
+ "log_odds_chosen": 1.4566174745559692,
1042
+ "log_odds_ratio": -0.32581037282943726,
1043
+ "logits/chosen": 2.691676616668701,
1044
+ "logits/rejected": 4.661564826965332,
1045
+ "logps/chosen": -0.4493564963340759,
1046
+ "logps/rejected": -1.0139671564102173,
1047
+ "loss": 0.5782,
1048
+ "nll_loss": 0.37120580673217773,
1049
+ "rewards/accuracies": 0.800000011920929,
1050
+ "rewards/chosen": -0.022467825561761856,
1051
+ "rewards/margins": 0.028230536729097366,
1052
+ "rewards/rejected": -0.05069836229085922,
1053
+ "step": 580
1054
+ },
1055
+ {
1056
+ "epoch": 2.72,
1057
+ "grad_norm": 2.0625,
1058
+ "learning_rate": 2.058467423981546e-06,
1059
+ "log_odds_chosen": 1.0190517902374268,
1060
+ "log_odds_ratio": -0.5730624198913574,
1061
+ "logits/chosen": 3.407086133956909,
1062
+ "logits/rejected": 4.482596397399902,
1063
+ "logps/chosen": -0.7345553040504456,
1064
+ "logps/rejected": -0.9309635162353516,
1065
+ "loss": 0.5723,
1066
+ "nll_loss": 0.5519307851791382,
1067
+ "rewards/accuracies": 0.5,
1068
+ "rewards/chosen": -0.03672776371240616,
1069
+ "rewards/margins": 0.009820410050451756,
1070
+ "rewards/rejected": -0.04654817283153534,
1071
+ "step": 590
1072
+ },
1073
+ {
1074
+ "epoch": 2.76,
1075
+ "grad_norm": 2.375,
1076
+ "learning_rate": 2.0412414523193154e-06,
1077
+ "log_odds_chosen": 1.107779860496521,
1078
+ "log_odds_ratio": -0.40593117475509644,
1079
+ "logits/chosen": 3.215078830718994,
1080
+ "logits/rejected": 4.503358840942383,
1081
+ "logps/chosen": -0.663019597530365,
1082
+ "logps/rejected": -1.2786920070648193,
1083
+ "loss": 0.5815,
1084
+ "nll_loss": 0.5633824467658997,
1085
+ "rewards/accuracies": 0.800000011920929,
1086
+ "rewards/chosen": -0.03315097838640213,
1087
+ "rewards/margins": 0.030783619731664658,
1088
+ "rewards/rejected": -0.06393460184335709,
1089
+ "step": 600
1090
+ },
1091
+ {
1092
+ "epoch": 2.81,
1093
+ "grad_norm": 2.09375,
1094
+ "learning_rate": 2.0244408254472904e-06,
1095
+ "log_odds_chosen": 0.7602224349975586,
1096
+ "log_odds_ratio": -0.5018362998962402,
1097
+ "logits/chosen": 3.604353666305542,
1098
+ "logits/rejected": 4.481316089630127,
1099
+ "logps/chosen": -0.7105517387390137,
1100
+ "logps/rejected": -1.0740478038787842,
1101
+ "loss": 0.5873,
1102
+ "nll_loss": 0.5312780737876892,
1103
+ "rewards/accuracies": 0.699999988079071,
1104
+ "rewards/chosen": -0.035527586936950684,
1105
+ "rewards/margins": 0.018174810335040092,
1106
+ "rewards/rejected": -0.05370239168405533,
1107
+ "step": 610
1108
+ },
1109
+ {
1110
+ "epoch": 2.86,
1111
+ "grad_norm": 1.90625,
1112
+ "learning_rate": 2.0080483222562476e-06,
1113
+ "log_odds_chosen": 1.3286904096603394,
1114
+ "log_odds_ratio": -0.36574870347976685,
1115
+ "logits/chosen": 3.620469331741333,
1116
+ "logits/rejected": 4.373411655426025,
1117
+ "logps/chosen": -0.4990506172180176,
1118
+ "logps/rejected": -0.953050971031189,
1119
+ "loss": 0.5716,
1120
+ "nll_loss": 0.5527733564376831,
1121
+ "rewards/accuracies": 0.8999999761581421,
1122
+ "rewards/chosen": -0.024952532723546028,
1123
+ "rewards/margins": 0.022700021043419838,
1124
+ "rewards/rejected": -0.047652553766965866,
1125
+ "step": 620
1126
+ },
1127
+ {
1128
+ "epoch": 2.9,
1129
+ "grad_norm": 2.359375,
1130
+ "learning_rate": 1.9920476822239895e-06,
1131
+ "log_odds_chosen": 0.4847317636013031,
1132
+ "log_odds_ratio": -0.5640643835067749,
1133
+ "logits/chosen": 3.125113010406494,
1134
+ "logits/rejected": 3.340205669403076,
1135
+ "logps/chosen": -0.8360971212387085,
1136
+ "logps/rejected": -1.0480194091796875,
1137
+ "loss": 0.5738,
1138
+ "nll_loss": 0.6136351823806763,
1139
+ "rewards/accuracies": 0.699999988079071,
1140
+ "rewards/chosen": -0.041804857552051544,
1141
+ "rewards/margins": 0.010596117004752159,
1142
+ "rewards/rejected": -0.05240097641944885,
1143
+ "step": 630
1144
+ },
1145
+ {
1146
+ "epoch": 2.95,
1147
+ "grad_norm": 2.09375,
1148
+ "learning_rate": 1.976423537605237e-06,
1149
+ "log_odds_chosen": 0.8931509256362915,
1150
+ "log_odds_ratio": -0.40087467432022095,
1151
+ "logits/chosen": 3.574153423309326,
1152
+ "logits/rejected": 4.537802219390869,
1153
+ "logps/chosen": -0.6440940499305725,
1154
+ "logps/rejected": -1.088226556777954,
1155
+ "loss": 0.5846,
1156
+ "nll_loss": 0.5598152875900269,
1157
+ "rewards/accuracies": 0.8999999761581421,
1158
+ "rewards/chosen": -0.032204702496528625,
1159
+ "rewards/margins": 0.0222066268324852,
1160
+ "rewards/rejected": -0.054411329329013824,
1161
+ "step": 640
1162
+ },
1163
+ {
1164
+ "epoch": 3.0,
1165
+ "grad_norm": 2.5625,
1166
+ "learning_rate": 1.961161351381841e-06,
1167
+ "log_odds_chosen": 1.2053475379943848,
1168
+ "log_odds_ratio": -0.430248886346817,
1169
+ "logits/chosen": 2.245370388031006,
1170
+ "logits/rejected": 3.5309462547302246,
1171
+ "logps/chosen": -0.5642444491386414,
1172
+ "logps/rejected": -0.9910544157028198,
1173
+ "loss": 0.5605,
1174
+ "nll_loss": 0.45886915922164917,
1175
+ "rewards/accuracies": 0.699999988079071,
1176
+ "rewards/chosen": -0.028212225064635277,
1177
+ "rewards/margins": 0.021340493112802505,
1178
+ "rewards/rejected": -0.04955272004008293,
1179
+ "step": 650
1180
+ },
1181
+ {
1182
+ "epoch": 3.0,
1183
+ "step": 651,
1184
+ "total_flos": 0.0,
1185
+ "train_loss": 0.812556631554107,
1186
+ "train_runtime": 4771.9621,
1187
+ "train_samples_per_second": 4.358,
1188
+ "train_steps_per_second": 0.136
1189
+ }
1190
+ ],
1191
+ "logging_steps": 10,
1192
+ "max_steps": 651,
1193
+ "num_input_tokens_seen": 0,
1194
+ "num_train_epochs": 3,
1195
+ "save_steps": 500,
1196
+ "total_flos": 0.0,
1197
+ "train_batch_size": 1,
1198
+ "trial_name": null,
1199
+ "trial_params": null
1200
+ }