LoneStriker commited on
Commit
085393e
1 Parent(s): cbd9fcb

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card for functionary-small-v2.2
2
+
3
+ [https://github.com/MeetKai/functionary](https://github.com/MeetKai/functionary)
4
+
5
+ ![Functionary Logo](https://huggingface.co/meetkai/functionary-medium-v2.2/resolve/main/functionary_logo.jpg "Functionary Logo")
6
+
7
+ Functionary is a language model that can interpret and execute functions/plugins.
8
+
9
+ The model determines when to execute functions, whether in parallel or serially, and can understand their outputs. It only triggers functions as needed. Function definitions are given as JSON Schema Objects, similar to OpenAI GPT function calls.
10
+
11
+ ## Key Features
12
+
13
+ - Intelligent **parallel tool use**
14
+ - Able to analyze functions/tools outputs and provide relevant responses **grounded in the outputs**
15
+ - Able to decide **when to not use tools/call functions** and provide normal chat response
16
+ - Truly one of the best open-source alternative to GPT-4
17
+
18
+ ## Performance
19
+
20
+ Our model achieves achieves state-of-the-art performance in Function Calling Accuracy on our in-house dataset. The accuracy metric measures the overall correctness of predicted function calls, including function name prediction and arguments extraction.
21
+
22
+ ![Eval Chart](https://huggingface.co/meetkai/functionary-medium-v2.2/resolve/main/evaluation_chart.jpeg "Eval Chart")
23
+
24
+ | Dataset | Model Name | Function Calling Accuracy (Name & Arguments) |
25
+ | :-------------| :-------------------| ---------------------------: |
26
+ | In-house data | MeetKai-functionary-small-v2.2 | 0.546|
27
+ | In-house data | MeetKai-functionary-medium-v2.2 | **0.664**|
28
+ | In-house data | OpenAI-gpt-3.5-turbo-1106 | 0.531 |
29
+ | In-house data | OpenAI-gpt-4-1106-preview | **0.737** |
30
+
31
+ ## Prompt Template
32
+
33
+ We use a specially designed prompt template which we call "v2PromptTemplate" that breaks down each turns into from, recipient and content portions.
34
+
35
+ We convert function definitions to a similar text to TypeScript definitions. Then we inject these definitions as system prompts. After that, we inject the default system prompt. Then we start the conversation messages.
36
+
37
+ This formatting is also available via our vLLM server which we process the functions into Typescript definitions encapsulated in a system message and use a pre-defined Transformers chat template. This means that lists of messages can be formatted for you with the apply_chat_template() method within our server:
38
+
39
+ ```python
40
+ from openai import OpenAI
41
+
42
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="functionary")
43
+
44
+ client.chat.completions.create(
45
+ model="path/to/functionary/model/",
46
+ messages=[{"role": "user",
47
+ "content": "What is the weather for Istanbul?"}
48
+ ],
49
+ tools=[{
50
+ "type": "function",
51
+ "function": {
52
+ "name": "get_current_weather",
53
+ "description": "Get the current weather",
54
+ "parameters": {
55
+ "type": "object",
56
+ "properties": {
57
+ "location": {
58
+ "type": "string",
59
+ "description": "The city and state, e.g. San Francisco, CA"
60
+ }
61
+ },
62
+ "required": ["location"]
63
+ }
64
+ }
65
+ }],
66
+ tool_choice="auto"
67
+ )
68
+ ```
69
+
70
+ will yield:
71
+
72
+ ```
73
+ <|from|>system
74
+ <|recipient|>all
75
+ <|content|>// Supported function definitions that should be called when necessary.
76
+ namespace functions {
77
+ // Get the current weather
78
+ type get_current_weather = (_: {
79
+ // The city and state, e.g. San Francisco, CA
80
+ location: string,
81
+ }) => any;
82
+ } // namespace functions
83
+ <|from|>system
84
+ <|recipient|>all
85
+ <|content|>A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary
86
+ <|from|>user
87
+ <|recipient|>all
88
+ <|content|>What is the weather for Istanbul?
89
+ ```
90
+
91
+ A more detailed example is provided [here](https://github.com/MeetKai/functionary/blob/main/tests/prompt_test_v2.txt).
92
+
93
+ ## Run the model
94
+
95
+ We encourage users to run our models using our OpenAI-compatible vLLM server [here](https://github.com/MeetKai/functionary).
96
+
97
+ # The MeetKai Team
98
+ ![MeetKai Logo](https://huggingface.co/meetkai/functionary-medium-v2.2/resolve/main/meetkai_logo.png "MeetKai Logo")
added_tokens.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "<|content|>": 32000,
3
+ "<|from|>": 32002,
4
+ "<|recipient|>": 32001,
5
+ "<|stop|>": 32003
6
+ }
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/workspace/Mistral-7B-v0.1",
3
+ "architectures": [
4
+ "MistralForCausalLM"
5
+ ],
6
+ "bos_token_id": 1,
7
+ "eos_token_id": 2,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 4096,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 14336,
12
+ "max_position_embeddings": 32768,
13
+ "model_type": "mistral",
14
+ "num_attention_heads": 32,
15
+ "num_hidden_layers": 32,
16
+ "num_key_value_heads": 8,
17
+ "rms_norm_eps": 1e-05,
18
+ "rope_theta": 10000.0,
19
+ "sliding_window": 8192,
20
+ "tie_word_embeddings": false,
21
+ "torch_dtype": "bfloat16",
22
+ "transformers_version": "4.35.2",
23
+ "use_cache": false,
24
+ "vocab_size": 32004
25
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.35.2"
6
+ }
latest ADDED
@@ -0,0 +1 @@
 
 
1
+ global_step391
model.safetensors.index.json ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 14483529728
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00003-of-00003.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00003.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
13
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
14
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
15
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
16
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
17
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
18
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
19
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
20
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
21
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
22
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
23
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
24
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
25
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
26
+ "model.layers.10.input_layernorm.weight": "model-00002-of-00003.safetensors",
27
+ "model.layers.10.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
28
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
29
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
30
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
31
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
32
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
33
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
34
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
35
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00003.safetensors",
36
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
37
+ "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
38
+ "model.layers.11.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
39
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
40
+ "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
41
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
42
+ "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
43
+ "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
44
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00003.safetensors",
45
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
46
+ "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
47
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
48
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
49
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
50
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
51
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
52
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
53
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00003.safetensors",
54
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
55
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
56
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
57
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
58
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
59
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
60
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
61
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
62
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00003.safetensors",
63
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
64
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
65
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
66
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
67
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
68
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
69
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
70
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
71
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00003.safetensors",
72
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
73
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
74
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
75
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
76
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
77
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
78
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
79
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
80
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00003.safetensors",
81
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
82
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
83
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
84
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
85
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
86
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
87
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
88
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
89
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00003.safetensors",
90
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
91
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
92
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
93
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
94
+ "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
95
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
96
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
97
+ "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
98
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00003.safetensors",
99
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
100
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
101
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
102
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
103
+ "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
104
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
105
+ "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
106
+ "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
107
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00003.safetensors",
108
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
109
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
110
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
111
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
112
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
113
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
114
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
115
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
116
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
117
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
118
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
119
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
120
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
121
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
122
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
123
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
124
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
125
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00003.safetensors",
126
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
127
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
128
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
129
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
130
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
131
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
132
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
133
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
134
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00003.safetensors",
135
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
136
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
137
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
138
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
139
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
140
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
141
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
142
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
143
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00003.safetensors",
144
+ "model.layers.22.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
145
+ "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
146
+ "model.layers.22.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
147
+ "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
148
+ "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
149
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
150
+ "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
151
+ "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
152
+ "model.layers.23.input_layernorm.weight": "model-00003-of-00003.safetensors",
153
+ "model.layers.23.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
154
+ "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
155
+ "model.layers.23.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
156
+ "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
157
+ "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
158
+ "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
159
+ "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
160
+ "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
161
+ "model.layers.24.input_layernorm.weight": "model-00003-of-00003.safetensors",
162
+ "model.layers.24.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
163
+ "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
164
+ "model.layers.24.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
165
+ "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
166
+ "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
167
+ "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
168
+ "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
169
+ "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
170
+ "model.layers.25.input_layernorm.weight": "model-00003-of-00003.safetensors",
171
+ "model.layers.25.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
172
+ "model.layers.25.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
173
+ "model.layers.25.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
174
+ "model.layers.25.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
175
+ "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
176
+ "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
177
+ "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
178
+ "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
179
+ "model.layers.26.input_layernorm.weight": "model-00003-of-00003.safetensors",
180
+ "model.layers.26.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
181
+ "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
182
+ "model.layers.26.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
183
+ "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
184
+ "model.layers.26.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
185
+ "model.layers.26.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
186
+ "model.layers.26.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
187
+ "model.layers.26.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
188
+ "model.layers.27.input_layernorm.weight": "model-00003-of-00003.safetensors",
189
+ "model.layers.27.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
190
+ "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
191
+ "model.layers.27.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
192
+ "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
193
+ "model.layers.27.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
194
+ "model.layers.27.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
195
+ "model.layers.27.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
196
+ "model.layers.27.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
197
+ "model.layers.28.input_layernorm.weight": "model-00003-of-00003.safetensors",
198
+ "model.layers.28.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
199
+ "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
200
+ "model.layers.28.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
201
+ "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
202
+ "model.layers.28.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
203
+ "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
204
+ "model.layers.28.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
205
+ "model.layers.28.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
206
+ "model.layers.29.input_layernorm.weight": "model-00003-of-00003.safetensors",
207
+ "model.layers.29.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
208
+ "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
209
+ "model.layers.29.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
210
+ "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
211
+ "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
212
+ "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
213
+ "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
214
+ "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
215
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
216
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
217
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
218
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
219
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
220
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
221
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
222
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
223
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
224
+ "model.layers.30.input_layernorm.weight": "model-00003-of-00003.safetensors",
225
+ "model.layers.30.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
226
+ "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
227
+ "model.layers.30.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
228
+ "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
229
+ "model.layers.30.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
230
+ "model.layers.30.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
231
+ "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
232
+ "model.layers.30.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
233
+ "model.layers.31.input_layernorm.weight": "model-00003-of-00003.safetensors",
234
+ "model.layers.31.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
235
+ "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
236
+ "model.layers.31.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
237
+ "model.layers.31.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
238
+ "model.layers.31.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
239
+ "model.layers.31.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
240
+ "model.layers.31.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
241
+ "model.layers.31.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
242
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
243
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
244
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
245
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
246
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
247
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
248
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
249
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
250
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
251
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
252
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
253
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
254
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
255
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
256
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
257
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
258
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
259
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
260
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
261
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
262
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
263
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
264
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
265
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
266
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
267
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
268
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
269
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
270
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
271
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
272
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
273
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
274
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
275
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
276
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
277
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
278
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00003.safetensors",
279
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
280
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
281
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
282
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
283
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
284
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
285
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
286
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
287
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00003.safetensors",
288
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
289
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
290
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
291
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
292
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
293
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
294
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
295
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
296
+ "model.norm.weight": "model-00003-of-00003.safetensors"
297
+ }
298
+ }
output.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fcfd1413a04fa390605aad7186f02f7ff799a3fc3c822ba8025cd66061801d85
3
+ size 7371566888
rng_state_0.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0c8da2112cac9e41d45713c72447917295b3b059729ed671987d5dcc0fb0c254
3
+ size 17655
rng_state_1.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b374776577680bc377adbabdd5860eb87275acd44e79acd7aedb841f06959181
3
+ size 17655
rng_state_2.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b059c65d5a97ad439710b18ba043862056570756ac73a43a78d4a434c77e90e
3
+ size 17655
rng_state_3.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8faf6721e879364fdad563b5bb73d12b66a5be82ac89c27333523f642e0a24f
3
+ size 17655
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e85c80a80a6bbd3759b33cade7c3b5dddde6bbf6cd23883d7b59c8506e144f98
3
+ size 627
special_tokens_map.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<|content|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<|recipient|>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ },
17
+ {
18
+ "content": "<|from|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ {
25
+ "content": "<|stop|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ ],
32
+ "bos_token": {
33
+ "content": "<s>",
34
+ "lstrip": false,
35
+ "normalized": false,
36
+ "rstrip": false,
37
+ "single_word": false
38
+ },
39
+ "eos_token": {
40
+ "content": "</s>",
41
+ "lstrip": false,
42
+ "normalized": false,
43
+ "rstrip": false,
44
+ "single_word": false
45
+ },
46
+ "pad_token": "</s>",
47
+ "unk_token": {
48
+ "content": "<unk>",
49
+ "lstrip": false,
50
+ "normalized": false,
51
+ "rstrip": false,
52
+ "single_word": false
53
+ }
54
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
3
+ size 493443
tokenizer_config.json ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<unk>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "32000": {
28
+ "content": "<|content|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "32001": {
36
+ "content": "<|recipient|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "32002": {
44
+ "content": "<|from|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "32003": {
52
+ "content": "<|stop|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ }
59
+ },
60
+ "additional_special_tokens": [
61
+ "<|content|>",
62
+ "<|recipient|>",
63
+ "<|from|>",
64
+ "<|stop|>"
65
+ ],
66
+ "bos_token": "<s>",
67
+ "chat_template": "{% for message in messages %}\n{% if message['role'] == 'user' or message['role'] == 'system' %}\n{{ '<|from|>' + message['role'] + '\n<|recipient|>all\n<|content|>' + message['content'] + '\n' }}{% elif message['role'] == 'tool' %}\n{{ '<|from|>' + message['name'] + '\n<|recipient|>all\n<|content|>' + message['content'] + '\n' }}{% else %}\n{% set contain_content='no'%}\n{% if message['content'] is not none %}\n{{ '<|from|>assistant\n<|recipient|>all\n<|content|>' + message['content'] }}{% set contain_content='yes'%}\n{% endif %}\n{% if 'tool_calls' in message and message['tool_calls'] is not none %}\n{% for tool_call in message['tool_calls'] %}\n{% set prompt='<|from|>assistant\n<|recipient|>' + tool_call['function']['name'] + '\n<|content|>' + tool_call['function']['arguments'] %}\n{% if loop.index == 1 and contain_content == \"no\" %}\n{{ prompt }}{% else %}\n{{ '\n' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ '<|stop|>\n' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ '<|from|>assistant\n<|recipient|>' }}{% endif %}",
68
+ "clean_up_tokenization_spaces": false,
69
+ "eos_token": "</s>",
70
+ "legacy": true,
71
+ "model_max_length": 8192,
72
+ "pad_token": "</s>",
73
+ "padding_side": "left",
74
+ "sp_model_kwargs": {},
75
+ "spaces_between_special_tokens": false,
76
+ "tokenizer_class": "LlamaTokenizer",
77
+ "unk_token": "<unk>",
78
+ "use_default_system_prompt": false
79
+ }
trainer_state.json ADDED
@@ -0,0 +1,2455 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 1.0,
5
+ "eval_steps": 75,
6
+ "global_step": 391,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0,
13
+ "learning_rate": 7.5e-07,
14
+ "loss": 1.0384,
15
+ "step": 1
16
+ },
17
+ {
18
+ "epoch": 0.01,
19
+ "learning_rate": 1.5e-06,
20
+ "loss": 1.0344,
21
+ "step": 2
22
+ },
23
+ {
24
+ "epoch": 0.01,
25
+ "learning_rate": 2.25e-06,
26
+ "loss": 0.8877,
27
+ "step": 3
28
+ },
29
+ {
30
+ "epoch": 0.01,
31
+ "learning_rate": 3e-06,
32
+ "loss": 0.9107,
33
+ "step": 4
34
+ },
35
+ {
36
+ "epoch": 0.01,
37
+ "learning_rate": 3.75e-06,
38
+ "loss": 0.8865,
39
+ "step": 5
40
+ },
41
+ {
42
+ "epoch": 0.02,
43
+ "learning_rate": 4.5e-06,
44
+ "loss": 1.0755,
45
+ "step": 6
46
+ },
47
+ {
48
+ "epoch": 0.02,
49
+ "learning_rate": 5.2500000000000006e-06,
50
+ "loss": 0.9334,
51
+ "step": 7
52
+ },
53
+ {
54
+ "epoch": 0.02,
55
+ "learning_rate": 6e-06,
56
+ "loss": 0.8215,
57
+ "step": 8
58
+ },
59
+ {
60
+ "epoch": 0.02,
61
+ "learning_rate": 6.75e-06,
62
+ "loss": 0.9601,
63
+ "step": 9
64
+ },
65
+ {
66
+ "epoch": 0.03,
67
+ "learning_rate": 7.5e-06,
68
+ "loss": 0.7402,
69
+ "step": 10
70
+ },
71
+ {
72
+ "epoch": 0.03,
73
+ "learning_rate": 8.25e-06,
74
+ "loss": 0.8968,
75
+ "step": 11
76
+ },
77
+ {
78
+ "epoch": 0.03,
79
+ "learning_rate": 9e-06,
80
+ "loss": 0.7669,
81
+ "step": 12
82
+ },
83
+ {
84
+ "epoch": 0.03,
85
+ "learning_rate": 8.999845402895058e-06,
86
+ "loss": 0.7714,
87
+ "step": 13
88
+ },
89
+ {
90
+ "epoch": 0.04,
91
+ "learning_rate": 8.999381622202572e-06,
92
+ "loss": 0.6609,
93
+ "step": 14
94
+ },
95
+ {
96
+ "epoch": 0.04,
97
+ "learning_rate": 8.998608689788832e-06,
98
+ "loss": 0.8098,
99
+ "step": 15
100
+ },
101
+ {
102
+ "epoch": 0.04,
103
+ "learning_rate": 8.997526658761886e-06,
104
+ "loss": 0.8355,
105
+ "step": 16
106
+ },
107
+ {
108
+ "epoch": 0.04,
109
+ "learning_rate": 8.996135603467899e-06,
110
+ "loss": 0.8318,
111
+ "step": 17
112
+ },
113
+ {
114
+ "epoch": 0.05,
115
+ "learning_rate": 8.994435619486036e-06,
116
+ "loss": 0.9067,
117
+ "step": 18
118
+ },
119
+ {
120
+ "epoch": 0.05,
121
+ "learning_rate": 8.992426823621897e-06,
122
+ "loss": 0.7113,
123
+ "step": 19
124
+ },
125
+ {
126
+ "epoch": 0.05,
127
+ "learning_rate": 8.99010935389949e-06,
128
+ "loss": 0.804,
129
+ "step": 20
130
+ },
131
+ {
132
+ "epoch": 0.05,
133
+ "learning_rate": 8.987483369551757e-06,
134
+ "loss": 0.776,
135
+ "step": 21
136
+ },
137
+ {
138
+ "epoch": 0.06,
139
+ "learning_rate": 8.984549051009623e-06,
140
+ "loss": 0.8339,
141
+ "step": 22
142
+ },
143
+ {
144
+ "epoch": 0.06,
145
+ "learning_rate": 8.981306599889595e-06,
146
+ "loss": 0.833,
147
+ "step": 23
148
+ },
149
+ {
150
+ "epoch": 0.06,
151
+ "learning_rate": 8.977756238979921e-06,
152
+ "loss": 0.7614,
153
+ "step": 24
154
+ },
155
+ {
156
+ "epoch": 0.06,
157
+ "learning_rate": 8.973898212225277e-06,
158
+ "loss": 0.7553,
159
+ "step": 25
160
+ },
161
+ {
162
+ "epoch": 0.07,
163
+ "learning_rate": 8.969732784710005e-06,
164
+ "loss": 0.6437,
165
+ "step": 26
166
+ },
167
+ {
168
+ "epoch": 0.07,
169
+ "learning_rate": 8.965260242639897e-06,
170
+ "loss": 0.7507,
171
+ "step": 27
172
+ },
173
+ {
174
+ "epoch": 0.07,
175
+ "learning_rate": 8.960480893322532e-06,
176
+ "loss": 0.722,
177
+ "step": 28
178
+ },
179
+ {
180
+ "epoch": 0.07,
181
+ "learning_rate": 8.955395065146165e-06,
182
+ "loss": 0.775,
183
+ "step": 29
184
+ },
185
+ {
186
+ "epoch": 0.08,
187
+ "learning_rate": 8.950003107557156e-06,
188
+ "loss": 0.6426,
189
+ "step": 30
190
+ },
191
+ {
192
+ "epoch": 0.08,
193
+ "learning_rate": 8.944305391035962e-06,
194
+ "loss": 0.5657,
195
+ "step": 31
196
+ },
197
+ {
198
+ "epoch": 0.08,
199
+ "learning_rate": 8.938302307071689e-06,
200
+ "loss": 0.7168,
201
+ "step": 32
202
+ },
203
+ {
204
+ "epoch": 0.08,
205
+ "learning_rate": 8.93199426813518e-06,
206
+ "loss": 0.6611,
207
+ "step": 33
208
+ },
209
+ {
210
+ "epoch": 0.09,
211
+ "learning_rate": 8.92538170765068e-06,
212
+ "loss": 0.5408,
213
+ "step": 34
214
+ },
215
+ {
216
+ "epoch": 0.09,
217
+ "learning_rate": 8.918465079966063e-06,
218
+ "loss": 0.7239,
219
+ "step": 35
220
+ },
221
+ {
222
+ "epoch": 0.09,
223
+ "learning_rate": 8.9112448603216e-06,
224
+ "loss": 0.7005,
225
+ "step": 36
226
+ },
227
+ {
228
+ "epoch": 0.09,
229
+ "learning_rate": 8.90372154481732e-06,
230
+ "loss": 0.5097,
231
+ "step": 37
232
+ },
233
+ {
234
+ "epoch": 0.1,
235
+ "learning_rate": 8.895895650378904e-06,
236
+ "loss": 0.584,
237
+ "step": 38
238
+ },
239
+ {
240
+ "epoch": 0.1,
241
+ "learning_rate": 8.887767714722187e-06,
242
+ "loss": 0.7294,
243
+ "step": 39
244
+ },
245
+ {
246
+ "epoch": 0.1,
247
+ "learning_rate": 8.879338296316203e-06,
248
+ "loss": 0.7048,
249
+ "step": 40
250
+ },
251
+ {
252
+ "epoch": 0.1,
253
+ "learning_rate": 8.870607974344808e-06,
254
+ "loss": 0.7611,
255
+ "step": 41
256
+ },
257
+ {
258
+ "epoch": 0.11,
259
+ "learning_rate": 8.861577348666893e-06,
260
+ "loss": 0.5931,
261
+ "step": 42
262
+ },
263
+ {
264
+ "epoch": 0.11,
265
+ "learning_rate": 8.852247039775163e-06,
266
+ "loss": 0.7365,
267
+ "step": 43
268
+ },
269
+ {
270
+ "epoch": 0.11,
271
+ "learning_rate": 8.842617688753502e-06,
272
+ "loss": 0.6636,
273
+ "step": 44
274
+ },
275
+ {
276
+ "epoch": 0.12,
277
+ "learning_rate": 8.832689957232929e-06,
278
+ "loss": 0.6412,
279
+ "step": 45
280
+ },
281
+ {
282
+ "epoch": 0.12,
283
+ "learning_rate": 8.822464527346134e-06,
284
+ "loss": 0.6989,
285
+ "step": 46
286
+ },
287
+ {
288
+ "epoch": 0.12,
289
+ "learning_rate": 8.81194210168061e-06,
290
+ "loss": 0.7198,
291
+ "step": 47
292
+ },
293
+ {
294
+ "epoch": 0.12,
295
+ "learning_rate": 8.801123403230374e-06,
296
+ "loss": 0.6605,
297
+ "step": 48
298
+ },
299
+ {
300
+ "epoch": 0.13,
301
+ "learning_rate": 8.7900091753463e-06,
302
+ "loss": 0.6546,
303
+ "step": 49
304
+ },
305
+ {
306
+ "epoch": 0.13,
307
+ "learning_rate": 8.778600181685031e-06,
308
+ "loss": 0.7965,
309
+ "step": 50
310
+ },
311
+ {
312
+ "epoch": 0.13,
313
+ "learning_rate": 8.766897206156524e-06,
314
+ "loss": 0.6213,
315
+ "step": 51
316
+ },
317
+ {
318
+ "epoch": 0.13,
319
+ "learning_rate": 8.754901052870166e-06,
320
+ "loss": 0.6297,
321
+ "step": 52
322
+ },
323
+ {
324
+ "epoch": 0.14,
325
+ "learning_rate": 8.742612546079549e-06,
326
+ "loss": 0.78,
327
+ "step": 53
328
+ },
329
+ {
330
+ "epoch": 0.14,
331
+ "learning_rate": 8.730032530125813e-06,
332
+ "loss": 0.6817,
333
+ "step": 54
334
+ },
335
+ {
336
+ "epoch": 0.14,
337
+ "learning_rate": 8.717161869379647e-06,
338
+ "loss": 0.6632,
339
+ "step": 55
340
+ },
341
+ {
342
+ "epoch": 0.14,
343
+ "learning_rate": 8.70400144818189e-06,
344
+ "loss": 0.5498,
345
+ "step": 56
346
+ },
347
+ {
348
+ "epoch": 0.15,
349
+ "learning_rate": 8.690552170782772e-06,
350
+ "loss": 0.6081,
351
+ "step": 57
352
+ },
353
+ {
354
+ "epoch": 0.15,
355
+ "learning_rate": 8.676814961279782e-06,
356
+ "loss": 0.4716,
357
+ "step": 58
358
+ },
359
+ {
360
+ "epoch": 0.15,
361
+ "learning_rate": 8.662790763554175e-06,
362
+ "loss": 0.5989,
363
+ "step": 59
364
+ },
365
+ {
366
+ "epoch": 0.15,
367
+ "learning_rate": 8.64848054120611e-06,
368
+ "loss": 0.535,
369
+ "step": 60
370
+ },
371
+ {
372
+ "epoch": 0.16,
373
+ "learning_rate": 8.633885277488455e-06,
374
+ "loss": 0.5284,
375
+ "step": 61
376
+ },
377
+ {
378
+ "epoch": 0.16,
379
+ "learning_rate": 8.619005975239218e-06,
380
+ "loss": 0.7458,
381
+ "step": 62
382
+ },
383
+ {
384
+ "epoch": 0.16,
385
+ "learning_rate": 8.603843656812642e-06,
386
+ "loss": 0.6758,
387
+ "step": 63
388
+ },
389
+ {
390
+ "epoch": 0.16,
391
+ "learning_rate": 8.588399364008963e-06,
392
+ "loss": 0.6736,
393
+ "step": 64
394
+ },
395
+ {
396
+ "epoch": 0.17,
397
+ "learning_rate": 8.57267415800283e-06,
398
+ "loss": 0.4975,
399
+ "step": 65
400
+ },
401
+ {
402
+ "epoch": 0.17,
403
+ "learning_rate": 8.556669119270387e-06,
404
+ "loss": 0.5586,
405
+ "step": 66
406
+ },
407
+ {
408
+ "epoch": 0.17,
409
+ "learning_rate": 8.540385347515032e-06,
410
+ "loss": 0.6974,
411
+ "step": 67
412
+ },
413
+ {
414
+ "epoch": 0.17,
415
+ "learning_rate": 8.523823961591867e-06,
416
+ "loss": 0.6778,
417
+ "step": 68
418
+ },
419
+ {
420
+ "epoch": 0.18,
421
+ "learning_rate": 8.506986099430807e-06,
422
+ "loss": 0.6791,
423
+ "step": 69
424
+ },
425
+ {
426
+ "epoch": 0.18,
427
+ "learning_rate": 8.48987291795841e-06,
428
+ "loss": 0.5207,
429
+ "step": 70
430
+ },
431
+ {
432
+ "epoch": 0.18,
433
+ "learning_rate": 8.472485593018366e-06,
434
+ "loss": 0.5651,
435
+ "step": 71
436
+ },
437
+ {
438
+ "epoch": 0.18,
439
+ "learning_rate": 8.45482531929072e-06,
440
+ "loss": 0.6487,
441
+ "step": 72
442
+ },
443
+ {
444
+ "epoch": 0.19,
445
+ "learning_rate": 8.436893310209779e-06,
446
+ "loss": 0.6202,
447
+ "step": 73
448
+ },
449
+ {
450
+ "epoch": 0.19,
451
+ "learning_rate": 8.418690797880737e-06,
452
+ "loss": 0.5263,
453
+ "step": 74
454
+ },
455
+ {
456
+ "epoch": 0.19,
457
+ "learning_rate": 8.400219032995022e-06,
458
+ "loss": 0.7257,
459
+ "step": 75
460
+ },
461
+ {
462
+ "epoch": 0.19,
463
+ "eval_accuracy": 0.814565802103578,
464
+ "eval_accuracy_<|content|>": 0.9681089145841104,
465
+ "eval_accuracy_<|from|>": 0.9962073324905183,
466
+ "eval_accuracy_<|recipient|>": 0.9987357774968394,
467
+ "eval_accuracy_<|stop|>": 0.8882521489971347,
468
+ "eval_accuracy_total_num_<|content|>": 5362,
469
+ "eval_accuracy_total_num_<|from|>": 791,
470
+ "eval_accuracy_total_num_<|recipient|>": 791,
471
+ "eval_accuracy_total_num_<|stop|>": 4537,
472
+ "eval_loss": 0.618374228477478,
473
+ "eval_perplexity": 1.0682635492781798,
474
+ "eval_runtime": 290.3169,
475
+ "eval_samples_per_second": 4.736,
476
+ "eval_steps_per_second": 0.148,
477
+ "step": 75
478
+ },
479
+ {
480
+ "epoch": 0.19,
481
+ "learning_rate": 8.381479284744354e-06,
482
+ "loss": 0.6557,
483
+ "step": 76
484
+ },
485
+ {
486
+ "epoch": 0.2,
487
+ "learning_rate": 8.362472840733548e-06,
488
+ "loss": 0.6288,
489
+ "step": 77
490
+ },
491
+ {
492
+ "epoch": 0.2,
493
+ "learning_rate": 8.343201006892032e-06,
494
+ "loss": 0.5379,
495
+ "step": 78
496
+ },
497
+ {
498
+ "epoch": 0.2,
499
+ "learning_rate": 8.323665107384127e-06,
500
+ "loss": 0.6197,
501
+ "step": 79
502
+ },
503
+ {
504
+ "epoch": 0.2,
505
+ "learning_rate": 8.303866484518059e-06,
506
+ "loss": 0.5107,
507
+ "step": 80
508
+ },
509
+ {
510
+ "epoch": 0.21,
511
+ "learning_rate": 8.283806498653725e-06,
512
+ "loss": 0.67,
513
+ "step": 81
514
+ },
515
+ {
516
+ "epoch": 0.21,
517
+ "learning_rate": 8.263486528109237e-06,
518
+ "loss": 0.6076,
519
+ "step": 82
520
+ },
521
+ {
522
+ "epoch": 0.21,
523
+ "learning_rate": 8.242907969066198e-06,
524
+ "loss": 0.6506,
525
+ "step": 83
526
+ },
527
+ {
528
+ "epoch": 0.21,
529
+ "learning_rate": 8.22207223547379e-06,
530
+ "loss": 0.6114,
531
+ "step": 84
532
+ },
533
+ {
534
+ "epoch": 0.22,
535
+ "learning_rate": 8.200980758951609e-06,
536
+ "loss": 0.4775,
537
+ "step": 85
538
+ },
539
+ {
540
+ "epoch": 0.22,
541
+ "learning_rate": 8.179634988691303e-06,
542
+ "loss": 0.6625,
543
+ "step": 86
544
+ },
545
+ {
546
+ "epoch": 0.22,
547
+ "learning_rate": 8.158036391357e-06,
548
+ "loss": 0.5161,
549
+ "step": 87
550
+ },
551
+ {
552
+ "epoch": 0.23,
553
+ "learning_rate": 8.136186450984527e-06,
554
+ "loss": 0.4956,
555
+ "step": 88
556
+ },
557
+ {
558
+ "epoch": 0.23,
559
+ "learning_rate": 8.114086668879454e-06,
560
+ "loss": 0.6379,
561
+ "step": 89
562
+ },
563
+ {
564
+ "epoch": 0.23,
565
+ "learning_rate": 8.09173856351393e-06,
566
+ "loss": 0.6708,
567
+ "step": 90
568
+ },
569
+ {
570
+ "epoch": 0.23,
571
+ "learning_rate": 8.069143670422347e-06,
572
+ "loss": 0.5111,
573
+ "step": 91
574
+ },
575
+ {
576
+ "epoch": 0.24,
577
+ "learning_rate": 8.046303542095846e-06,
578
+ "loss": 0.5079,
579
+ "step": 92
580
+ },
581
+ {
582
+ "epoch": 0.24,
583
+ "learning_rate": 8.02321974787563e-06,
584
+ "loss": 0.5029,
585
+ "step": 93
586
+ },
587
+ {
588
+ "epoch": 0.24,
589
+ "learning_rate": 7.999893873845152e-06,
590
+ "loss": 0.4098,
591
+ "step": 94
592
+ },
593
+ {
594
+ "epoch": 0.24,
595
+ "learning_rate": 7.976327522721114e-06,
596
+ "loss": 0.4683,
597
+ "step": 95
598
+ },
599
+ {
600
+ "epoch": 0.25,
601
+ "learning_rate": 7.952522313743371e-06,
602
+ "loss": 0.5922,
603
+ "step": 96
604
+ },
605
+ {
606
+ "epoch": 0.25,
607
+ "learning_rate": 7.928479882563648e-06,
608
+ "loss": 0.4946,
609
+ "step": 97
610
+ },
611
+ {
612
+ "epoch": 0.25,
613
+ "learning_rate": 7.904201881133171e-06,
614
+ "loss": 0.4977,
615
+ "step": 98
616
+ },
617
+ {
618
+ "epoch": 0.25,
619
+ "learning_rate": 7.879689977589154e-06,
620
+ "loss": 0.6153,
621
+ "step": 99
622
+ },
623
+ {
624
+ "epoch": 0.26,
625
+ "learning_rate": 7.85494585614019e-06,
626
+ "loss": 0.5741,
627
+ "step": 100
628
+ },
629
+ {
630
+ "epoch": 0.26,
631
+ "learning_rate": 7.829971216950513e-06,
632
+ "loss": 0.4949,
633
+ "step": 101
634
+ },
635
+ {
636
+ "epoch": 0.26,
637
+ "learning_rate": 7.804767776023202e-06,
638
+ "loss": 0.5977,
639
+ "step": 102
640
+ },
641
+ {
642
+ "epoch": 0.26,
643
+ "learning_rate": 7.779337265082256e-06,
644
+ "loss": 0.6465,
645
+ "step": 103
646
+ },
647
+ {
648
+ "epoch": 0.27,
649
+ "learning_rate": 7.753681431453614e-06,
650
+ "loss": 0.6386,
651
+ "step": 104
652
+ },
653
+ {
654
+ "epoch": 0.27,
655
+ "learning_rate": 7.727802037945105e-06,
656
+ "loss": 0.6208,
657
+ "step": 105
658
+ },
659
+ {
660
+ "epoch": 0.27,
661
+ "learning_rate": 7.70170086272531e-06,
662
+ "loss": 0.6551,
663
+ "step": 106
664
+ },
665
+ {
666
+ "epoch": 0.27,
667
+ "learning_rate": 7.675379699201395e-06,
668
+ "loss": 0.662,
669
+ "step": 107
670
+ },
671
+ {
672
+ "epoch": 0.28,
673
+ "learning_rate": 7.648840355895885e-06,
674
+ "loss": 0.5964,
675
+ "step": 108
676
+ },
677
+ {
678
+ "epoch": 0.28,
679
+ "learning_rate": 7.6220846563224e-06,
680
+ "loss": 0.6025,
681
+ "step": 109
682
+ },
683
+ {
684
+ "epoch": 0.28,
685
+ "learning_rate": 7.595114438860358e-06,
686
+ "loss": 0.5055,
687
+ "step": 110
688
+ },
689
+ {
690
+ "epoch": 0.28,
691
+ "learning_rate": 7.567931556628665e-06,
692
+ "loss": 0.5483,
693
+ "step": 111
694
+ },
695
+ {
696
+ "epoch": 0.29,
697
+ "learning_rate": 7.540537877358389e-06,
698
+ "loss": 0.6338,
699
+ "step": 112
700
+ },
701
+ {
702
+ "epoch": 0.29,
703
+ "learning_rate": 7.51293528326442e-06,
704
+ "loss": 0.6748,
705
+ "step": 113
706
+ },
707
+ {
708
+ "epoch": 0.29,
709
+ "learning_rate": 7.485125670916155e-06,
710
+ "loss": 0.5733,
711
+ "step": 114
712
+ },
713
+ {
714
+ "epoch": 0.29,
715
+ "learning_rate": 7.4571109511071714e-06,
716
+ "loss": 0.6164,
717
+ "step": 115
718
+ },
719
+ {
720
+ "epoch": 0.3,
721
+ "learning_rate": 7.428893048723952e-06,
722
+ "loss": 0.6496,
723
+ "step": 116
724
+ },
725
+ {
726
+ "epoch": 0.3,
727
+ "learning_rate": 7.400473902613611e-06,
728
+ "loss": 0.6056,
729
+ "step": 117
730
+ },
731
+ {
732
+ "epoch": 0.3,
733
+ "learning_rate": 7.371855465450694e-06,
734
+ "loss": 0.4862,
735
+ "step": 118
736
+ },
737
+ {
738
+ "epoch": 0.3,
739
+ "learning_rate": 7.343039703602988e-06,
740
+ "loss": 0.5984,
741
+ "step": 119
742
+ },
743
+ {
744
+ "epoch": 0.31,
745
+ "learning_rate": 7.314028596996431e-06,
746
+ "loss": 0.5891,
747
+ "step": 120
748
+ },
749
+ {
750
+ "epoch": 0.31,
751
+ "learning_rate": 7.284824138979066e-06,
752
+ "loss": 0.5345,
753
+ "step": 121
754
+ },
755
+ {
756
+ "epoch": 0.31,
757
+ "learning_rate": 7.255428336184075e-06,
758
+ "loss": 0.7941,
759
+ "step": 122
760
+ },
761
+ {
762
+ "epoch": 0.31,
763
+ "learning_rate": 7.2258432083919064e-06,
764
+ "loss": 0.552,
765
+ "step": 123
766
+ },
767
+ {
768
+ "epoch": 0.32,
769
+ "learning_rate": 7.196070788391497e-06,
770
+ "loss": 0.599,
771
+ "step": 124
772
+ },
773
+ {
774
+ "epoch": 0.32,
775
+ "learning_rate": 7.166113121840595e-06,
776
+ "loss": 0.4727,
777
+ "step": 125
778
+ },
779
+ {
780
+ "epoch": 0.32,
781
+ "learning_rate": 7.135972267125212e-06,
782
+ "loss": 0.545,
783
+ "step": 126
784
+ },
785
+ {
786
+ "epoch": 0.32,
787
+ "learning_rate": 7.1056502952181815e-06,
788
+ "loss": 0.6106,
789
+ "step": 127
790
+ },
791
+ {
792
+ "epoch": 0.33,
793
+ "learning_rate": 7.075149289536871e-06,
794
+ "loss": 0.4726,
795
+ "step": 128
796
+ },
797
+ {
798
+ "epoch": 0.33,
799
+ "learning_rate": 7.044471345800024e-06,
800
+ "loss": 0.5416,
801
+ "step": 129
802
+ },
803
+ {
804
+ "epoch": 0.33,
805
+ "learning_rate": 7.0136185718837685e-06,
806
+ "loss": 0.6889,
807
+ "step": 130
808
+ },
809
+ {
810
+ "epoch": 0.34,
811
+ "learning_rate": 6.9825930876767834e-06,
812
+ "loss": 0.594,
813
+ "step": 131
814
+ },
815
+ {
816
+ "epoch": 0.34,
817
+ "learning_rate": 6.951397024934641e-06,
818
+ "loss": 0.5456,
819
+ "step": 132
820
+ },
821
+ {
822
+ "epoch": 0.34,
823
+ "learning_rate": 6.920032527133334e-06,
824
+ "loss": 0.4282,
825
+ "step": 133
826
+ },
827
+ {
828
+ "epoch": 0.34,
829
+ "learning_rate": 6.888501749322002e-06,
830
+ "loss": 0.6089,
831
+ "step": 134
832
+ },
833
+ {
834
+ "epoch": 0.35,
835
+ "learning_rate": 6.856806857974848e-06,
836
+ "loss": 0.6507,
837
+ "step": 135
838
+ },
839
+ {
840
+ "epoch": 0.35,
841
+ "learning_rate": 6.824950030842293e-06,
842
+ "loss": 0.6411,
843
+ "step": 136
844
+ },
845
+ {
846
+ "epoch": 0.35,
847
+ "learning_rate": 6.792933456801339e-06,
848
+ "loss": 0.5493,
849
+ "step": 137
850
+ },
851
+ {
852
+ "epoch": 0.35,
853
+ "learning_rate": 6.760759335705163e-06,
854
+ "loss": 0.4705,
855
+ "step": 138
856
+ },
857
+ {
858
+ "epoch": 0.36,
859
+ "learning_rate": 6.728429878231978e-06,
860
+ "loss": 0.4708,
861
+ "step": 139
862
+ },
863
+ {
864
+ "epoch": 0.36,
865
+ "learning_rate": 6.695947305733131e-06,
866
+ "loss": 0.5714,
867
+ "step": 140
868
+ },
869
+ {
870
+ "epoch": 0.36,
871
+ "learning_rate": 6.6633138500804735e-06,
872
+ "loss": 0.5466,
873
+ "step": 141
874
+ },
875
+ {
876
+ "epoch": 0.36,
877
+ "learning_rate": 6.630531753513014e-06,
878
+ "loss": 0.4242,
879
+ "step": 142
880
+ },
881
+ {
882
+ "epoch": 0.37,
883
+ "learning_rate": 6.597603268482853e-06,
884
+ "loss": 0.6997,
885
+ "step": 143
886
+ },
887
+ {
888
+ "epoch": 0.37,
889
+ "learning_rate": 6.5645306575004145e-06,
890
+ "loss": 0.6938,
891
+ "step": 144
892
+ },
893
+ {
894
+ "epoch": 0.37,
895
+ "learning_rate": 6.531316192978991e-06,
896
+ "loss": 0.7704,
897
+ "step": 145
898
+ },
899
+ {
900
+ "epoch": 0.37,
901
+ "learning_rate": 6.497962157078611e-06,
902
+ "loss": 0.5127,
903
+ "step": 146
904
+ },
905
+ {
906
+ "epoch": 0.38,
907
+ "learning_rate": 6.4644708415492205e-06,
908
+ "loss": 0.6113,
909
+ "step": 147
910
+ },
911
+ {
912
+ "epoch": 0.38,
913
+ "learning_rate": 6.4308445475732315e-06,
914
+ "loss": 0.563,
915
+ "step": 148
916
+ },
917
+ {
918
+ "epoch": 0.38,
919
+ "learning_rate": 6.397085585607401e-06,
920
+ "loss": 0.6154,
921
+ "step": 149
922
+ },
923
+ {
924
+ "epoch": 0.38,
925
+ "learning_rate": 6.3631962752240746e-06,
926
+ "loss": 0.5561,
927
+ "step": 150
928
+ },
929
+ {
930
+ "epoch": 0.38,
931
+ "eval_accuracy": 0.819768782794211,
932
+ "eval_accuracy_<|content|>": 1.0,
933
+ "eval_accuracy_<|from|>": 0.9393173198482933,
934
+ "eval_accuracy_<|recipient|>": 1.0,
935
+ "eval_accuracy_<|stop|>": 0.8756887811329072,
936
+ "eval_accuracy_total_num_<|content|>": 5362,
937
+ "eval_accuracy_total_num_<|from|>": 791,
938
+ "eval_accuracy_total_num_<|recipient|>": 791,
939
+ "eval_accuracy_total_num_<|stop|>": 4537,
940
+ "eval_loss": 0.586061418056488,
941
+ "eval_perplexity": 1.0650794931769392,
942
+ "eval_runtime": 223.6772,
943
+ "eval_samples_per_second": 6.147,
944
+ "eval_steps_per_second": 0.192,
945
+ "step": 150
946
+ },
947
+ {
948
+ "epoch": 0.39,
949
+ "learning_rate": 6.32917894495182e-06,
950
+ "loss": 0.4405,
951
+ "step": 151
952
+ },
953
+ {
954
+ "epoch": 0.39,
955
+ "learning_rate": 6.295035932115428e-06,
956
+ "loss": 0.3631,
957
+ "step": 152
958
+ },
959
+ {
960
+ "epoch": 0.39,
961
+ "learning_rate": 6.260769582675315e-06,
962
+ "loss": 0.6282,
963
+ "step": 153
964
+ },
965
+ {
966
+ "epoch": 0.39,
967
+ "learning_rate": 6.226382251066333e-06,
968
+ "loss": 0.7472,
969
+ "step": 154
970
+ },
971
+ {
972
+ "epoch": 0.4,
973
+ "learning_rate": 6.191876300036003e-06,
974
+ "loss": 0.5803,
975
+ "step": 155
976
+ },
977
+ {
978
+ "epoch": 0.4,
979
+ "learning_rate": 6.1572541004821585e-06,
980
+ "loss": 0.5861,
981
+ "step": 156
982
+ },
983
+ {
984
+ "epoch": 0.4,
985
+ "learning_rate": 6.122518031290052e-06,
986
+ "loss": 0.6237,
987
+ "step": 157
988
+ },
989
+ {
990
+ "epoch": 0.4,
991
+ "learning_rate": 6.0876704791689e-06,
992
+ "loss": 0.5705,
993
+ "step": 158
994
+ },
995
+ {
996
+ "epoch": 0.41,
997
+ "learning_rate": 6.0527138384878885e-06,
998
+ "loss": 0.6131,
999
+ "step": 159
1000
+ },
1001
+ {
1002
+ "epoch": 0.41,
1003
+ "learning_rate": 6.017650511111662e-06,
1004
+ "loss": 0.6446,
1005
+ "step": 160
1006
+ },
1007
+ {
1008
+ "epoch": 0.41,
1009
+ "learning_rate": 5.98248290623529e-06,
1010
+ "loss": 0.5882,
1011
+ "step": 161
1012
+ },
1013
+ {
1014
+ "epoch": 0.41,
1015
+ "learning_rate": 5.947213440218726e-06,
1016
+ "loss": 0.5134,
1017
+ "step": 162
1018
+ },
1019
+ {
1020
+ "epoch": 0.42,
1021
+ "learning_rate": 5.911844536420788e-06,
1022
+ "loss": 0.6261,
1023
+ "step": 163
1024
+ },
1025
+ {
1026
+ "epoch": 0.42,
1027
+ "learning_rate": 5.8763786250326455e-06,
1028
+ "loss": 0.5609,
1029
+ "step": 164
1030
+ },
1031
+ {
1032
+ "epoch": 0.42,
1033
+ "learning_rate": 5.8408181429108425e-06,
1034
+ "loss": 0.5729,
1035
+ "step": 165
1036
+ },
1037
+ {
1038
+ "epoch": 0.42,
1039
+ "learning_rate": 5.805165533409863e-06,
1040
+ "loss": 0.5378,
1041
+ "step": 166
1042
+ },
1043
+ {
1044
+ "epoch": 0.43,
1045
+ "learning_rate": 5.769423246214248e-06,
1046
+ "loss": 0.5914,
1047
+ "step": 167
1048
+ },
1049
+ {
1050
+ "epoch": 0.43,
1051
+ "learning_rate": 5.733593737170271e-06,
1052
+ "loss": 0.5126,
1053
+ "step": 168
1054
+ },
1055
+ {
1056
+ "epoch": 0.43,
1057
+ "learning_rate": 5.6976794681172075e-06,
1058
+ "loss": 0.5309,
1059
+ "step": 169
1060
+ },
1061
+ {
1062
+ "epoch": 0.43,
1063
+ "learning_rate": 5.661682906718181e-06,
1064
+ "loss": 0.5544,
1065
+ "step": 170
1066
+ },
1067
+ {
1068
+ "epoch": 0.44,
1069
+ "learning_rate": 5.625606526290604e-06,
1070
+ "loss": 0.552,
1071
+ "step": 171
1072
+ },
1073
+ {
1074
+ "epoch": 0.44,
1075
+ "learning_rate": 5.5894528056362406e-06,
1076
+ "loss": 0.5787,
1077
+ "step": 172
1078
+ },
1079
+ {
1080
+ "epoch": 0.44,
1081
+ "learning_rate": 5.55322422887089e-06,
1082
+ "loss": 0.5153,
1083
+ "step": 173
1084
+ },
1085
+ {
1086
+ "epoch": 0.45,
1087
+ "learning_rate": 5.516923285253701e-06,
1088
+ "loss": 0.4472,
1089
+ "step": 174
1090
+ },
1091
+ {
1092
+ "epoch": 0.45,
1093
+ "learning_rate": 5.4805524690161325e-06,
1094
+ "loss": 0.6333,
1095
+ "step": 175
1096
+ },
1097
+ {
1098
+ "epoch": 0.45,
1099
+ "learning_rate": 5.444114279190586e-06,
1100
+ "loss": 0.4386,
1101
+ "step": 176
1102
+ },
1103
+ {
1104
+ "epoch": 0.45,
1105
+ "learning_rate": 5.407611219438685e-06,
1106
+ "loss": 0.613,
1107
+ "step": 177
1108
+ },
1109
+ {
1110
+ "epoch": 0.46,
1111
+ "learning_rate": 5.371045797879255e-06,
1112
+ "loss": 0.5707,
1113
+ "step": 178
1114
+ },
1115
+ {
1116
+ "epoch": 0.46,
1117
+ "learning_rate": 5.334420526915993e-06,
1118
+ "loss": 0.5866,
1119
+ "step": 179
1120
+ },
1121
+ {
1122
+ "epoch": 0.46,
1123
+ "learning_rate": 5.297737923064836e-06,
1124
+ "loss": 0.4922,
1125
+ "step": 180
1126
+ },
1127
+ {
1128
+ "epoch": 0.46,
1129
+ "learning_rate": 5.261000506781051e-06,
1130
+ "loss": 0.5826,
1131
+ "step": 181
1132
+ },
1133
+ {
1134
+ "epoch": 0.47,
1135
+ "learning_rate": 5.224210802286064e-06,
1136
+ "loss": 0.6741,
1137
+ "step": 182
1138
+ },
1139
+ {
1140
+ "epoch": 0.47,
1141
+ "learning_rate": 5.187371337394009e-06,
1142
+ "loss": 0.3813,
1143
+ "step": 183
1144
+ },
1145
+ {
1146
+ "epoch": 0.47,
1147
+ "learning_rate": 5.150484643338051e-06,
1148
+ "loss": 0.6125,
1149
+ "step": 184
1150
+ },
1151
+ {
1152
+ "epoch": 0.47,
1153
+ "learning_rate": 5.113553254596463e-06,
1154
+ "loss": 0.66,
1155
+ "step": 185
1156
+ },
1157
+ {
1158
+ "epoch": 0.48,
1159
+ "learning_rate": 5.076579708718481e-06,
1160
+ "loss": 0.5914,
1161
+ "step": 186
1162
+ },
1163
+ {
1164
+ "epoch": 0.48,
1165
+ "learning_rate": 5.039566546149946e-06,
1166
+ "loss": 0.4527,
1167
+ "step": 187
1168
+ },
1169
+ {
1170
+ "epoch": 0.48,
1171
+ "learning_rate": 5.002516310058766e-06,
1172
+ "loss": 0.5896,
1173
+ "step": 188
1174
+ },
1175
+ {
1176
+ "epoch": 0.48,
1177
+ "learning_rate": 4.965431546160153e-06,
1178
+ "loss": 0.4518,
1179
+ "step": 189
1180
+ },
1181
+ {
1182
+ "epoch": 0.49,
1183
+ "learning_rate": 4.928314802541726e-06,
1184
+ "loss": 0.5798,
1185
+ "step": 190
1186
+ },
1187
+ {
1188
+ "epoch": 0.49,
1189
+ "learning_rate": 4.891168629488419e-06,
1190
+ "loss": 0.5669,
1191
+ "step": 191
1192
+ },
1193
+ {
1194
+ "epoch": 0.49,
1195
+ "learning_rate": 4.853995579307262e-06,
1196
+ "loss": 0.6075,
1197
+ "step": 192
1198
+ },
1199
+ {
1200
+ "epoch": 0.49,
1201
+ "learning_rate": 4.816798206152006e-06,
1202
+ "loss": 0.6057,
1203
+ "step": 193
1204
+ },
1205
+ {
1206
+ "epoch": 0.5,
1207
+ "learning_rate": 4.77957906584763e-06,
1208
+ "loss": 0.5076,
1209
+ "step": 194
1210
+ },
1211
+ {
1212
+ "epoch": 0.5,
1213
+ "learning_rate": 4.742340715714727e-06,
1214
+ "loss": 0.5997,
1215
+ "step": 195
1216
+ },
1217
+ {
1218
+ "epoch": 0.5,
1219
+ "learning_rate": 4.705085714393797e-06,
1220
+ "loss": 0.5709,
1221
+ "step": 196
1222
+ },
1223
+ {
1224
+ "epoch": 0.5,
1225
+ "learning_rate": 4.667816621669442e-06,
1226
+ "loss": 0.5029,
1227
+ "step": 197
1228
+ },
1229
+ {
1230
+ "epoch": 0.51,
1231
+ "learning_rate": 4.630535998294477e-06,
1232
+ "loss": 0.5885,
1233
+ "step": 198
1234
+ },
1235
+ {
1236
+ "epoch": 0.51,
1237
+ "learning_rate": 4.5932464058139885e-06,
1238
+ "loss": 0.5776,
1239
+ "step": 199
1240
+ },
1241
+ {
1242
+ "epoch": 0.51,
1243
+ "learning_rate": 4.55595040638933e-06,
1244
+ "loss": 0.4949,
1245
+ "step": 200
1246
+ },
1247
+ {
1248
+ "epoch": 0.51,
1249
+ "learning_rate": 4.5186505626220725e-06,
1250
+ "loss": 0.4202,
1251
+ "step": 201
1252
+ },
1253
+ {
1254
+ "epoch": 0.52,
1255
+ "learning_rate": 4.48134943737793e-06,
1256
+ "loss": 0.5012,
1257
+ "step": 202
1258
+ },
1259
+ {
1260
+ "epoch": 0.52,
1261
+ "learning_rate": 4.444049593610671e-06,
1262
+ "loss": 0.613,
1263
+ "step": 203
1264
+ },
1265
+ {
1266
+ "epoch": 0.52,
1267
+ "learning_rate": 4.406753594186011e-06,
1268
+ "loss": 0.4806,
1269
+ "step": 204
1270
+ },
1271
+ {
1272
+ "epoch": 0.52,
1273
+ "learning_rate": 4.369464001705524e-06,
1274
+ "loss": 0.6567,
1275
+ "step": 205
1276
+ },
1277
+ {
1278
+ "epoch": 0.53,
1279
+ "learning_rate": 4.332183378330558e-06,
1280
+ "loss": 0.3533,
1281
+ "step": 206
1282
+ },
1283
+ {
1284
+ "epoch": 0.53,
1285
+ "learning_rate": 4.294914285606203e-06,
1286
+ "loss": 0.7252,
1287
+ "step": 207
1288
+ },
1289
+ {
1290
+ "epoch": 0.53,
1291
+ "learning_rate": 4.257659284285274e-06,
1292
+ "loss": 0.4848,
1293
+ "step": 208
1294
+ },
1295
+ {
1296
+ "epoch": 0.53,
1297
+ "learning_rate": 4.220420934152371e-06,
1298
+ "loss": 0.3598,
1299
+ "step": 209
1300
+ },
1301
+ {
1302
+ "epoch": 0.54,
1303
+ "learning_rate": 4.1832017938479936e-06,
1304
+ "loss": 0.5295,
1305
+ "step": 210
1306
+ },
1307
+ {
1308
+ "epoch": 0.54,
1309
+ "learning_rate": 4.146004420692739e-06,
1310
+ "loss": 0.5091,
1311
+ "step": 211
1312
+ },
1313
+ {
1314
+ "epoch": 0.54,
1315
+ "learning_rate": 4.108831370511581e-06,
1316
+ "loss": 0.5788,
1317
+ "step": 212
1318
+ },
1319
+ {
1320
+ "epoch": 0.54,
1321
+ "learning_rate": 4.071685197458274e-06,
1322
+ "loss": 0.462,
1323
+ "step": 213
1324
+ },
1325
+ {
1326
+ "epoch": 0.55,
1327
+ "learning_rate": 4.034568453839847e-06,
1328
+ "loss": 0.5832,
1329
+ "step": 214
1330
+ },
1331
+ {
1332
+ "epoch": 0.55,
1333
+ "learning_rate": 3.997483689941234e-06,
1334
+ "loss": 0.4365,
1335
+ "step": 215
1336
+ },
1337
+ {
1338
+ "epoch": 0.55,
1339
+ "learning_rate": 3.960433453850053e-06,
1340
+ "loss": 0.6247,
1341
+ "step": 216
1342
+ },
1343
+ {
1344
+ "epoch": 0.55,
1345
+ "learning_rate": 3.923420291281522e-06,
1346
+ "loss": 0.538,
1347
+ "step": 217
1348
+ },
1349
+ {
1350
+ "epoch": 0.56,
1351
+ "learning_rate": 3.886446745403538e-06,
1352
+ "loss": 0.5798,
1353
+ "step": 218
1354
+ },
1355
+ {
1356
+ "epoch": 0.56,
1357
+ "learning_rate": 3.849515356661949e-06,
1358
+ "loss": 0.568,
1359
+ "step": 219
1360
+ },
1361
+ {
1362
+ "epoch": 0.56,
1363
+ "learning_rate": 3.8126286626059916e-06,
1364
+ "loss": 0.4764,
1365
+ "step": 220
1366
+ },
1367
+ {
1368
+ "epoch": 0.57,
1369
+ "learning_rate": 3.7757891977139374e-06,
1370
+ "loss": 0.5134,
1371
+ "step": 221
1372
+ },
1373
+ {
1374
+ "epoch": 0.57,
1375
+ "learning_rate": 3.738999493218949e-06,
1376
+ "loss": 0.5493,
1377
+ "step": 222
1378
+ },
1379
+ {
1380
+ "epoch": 0.57,
1381
+ "learning_rate": 3.7022620769351665e-06,
1382
+ "loss": 0.5529,
1383
+ "step": 223
1384
+ },
1385
+ {
1386
+ "epoch": 0.57,
1387
+ "learning_rate": 3.665579473084008e-06,
1388
+ "loss": 0.4234,
1389
+ "step": 224
1390
+ },
1391
+ {
1392
+ "epoch": 0.58,
1393
+ "learning_rate": 3.6289542021207454e-06,
1394
+ "loss": 0.6216,
1395
+ "step": 225
1396
+ },
1397
+ {
1398
+ "epoch": 0.58,
1399
+ "eval_accuracy": 0.8257179536751563,
1400
+ "eval_accuracy_<|content|>": 1.0,
1401
+ "eval_accuracy_<|from|>": 0.9924146649810367,
1402
+ "eval_accuracy_<|recipient|>": 1.0,
1403
+ "eval_accuracy_<|stop|>": 0.9440158695173022,
1404
+ "eval_accuracy_total_num_<|content|>": 5362,
1405
+ "eval_accuracy_total_num_<|from|>": 791,
1406
+ "eval_accuracy_total_num_<|recipient|>": 791,
1407
+ "eval_accuracy_total_num_<|stop|>": 4537,
1408
+ "eval_loss": 0.5591428279876709,
1409
+ "eval_perplexity": 1.0623002333585039,
1410
+ "eval_runtime": 223.7635,
1411
+ "eval_samples_per_second": 6.145,
1412
+ "eval_steps_per_second": 0.192,
1413
+ "step": 225
1414
+ },
1415
+ {
1416
+ "epoch": 0.58,
1417
+ "learning_rate": 3.5923887805613165e-06,
1418
+ "loss": 0.6102,
1419
+ "step": 226
1420
+ },
1421
+ {
1422
+ "epoch": 0.58,
1423
+ "learning_rate": 3.5558857208094145e-06,
1424
+ "loss": 0.5378,
1425
+ "step": 227
1426
+ },
1427
+ {
1428
+ "epoch": 0.58,
1429
+ "learning_rate": 3.5194475309838677e-06,
1430
+ "loss": 0.6289,
1431
+ "step": 228
1432
+ },
1433
+ {
1434
+ "epoch": 0.59,
1435
+ "learning_rate": 3.4830767147463014e-06,
1436
+ "loss": 0.3814,
1437
+ "step": 229
1438
+ },
1439
+ {
1440
+ "epoch": 0.59,
1441
+ "learning_rate": 3.4467757711291108e-06,
1442
+ "loss": 0.5506,
1443
+ "step": 230
1444
+ },
1445
+ {
1446
+ "epoch": 0.59,
1447
+ "learning_rate": 3.4105471943637592e-06,
1448
+ "loss": 0.5942,
1449
+ "step": 231
1450
+ },
1451
+ {
1452
+ "epoch": 0.59,
1453
+ "learning_rate": 3.374393473709396e-06,
1454
+ "loss": 0.5092,
1455
+ "step": 232
1456
+ },
1457
+ {
1458
+ "epoch": 0.6,
1459
+ "learning_rate": 3.3383170932818184e-06,
1460
+ "loss": 0.5533,
1461
+ "step": 233
1462
+ },
1463
+ {
1464
+ "epoch": 0.6,
1465
+ "learning_rate": 3.302320531882792e-06,
1466
+ "loss": 0.454,
1467
+ "step": 234
1468
+ },
1469
+ {
1470
+ "epoch": 0.6,
1471
+ "learning_rate": 3.266406262829731e-06,
1472
+ "loss": 0.5039,
1473
+ "step": 235
1474
+ },
1475
+ {
1476
+ "epoch": 0.6,
1477
+ "learning_rate": 3.2305767537857534e-06,
1478
+ "loss": 0.5447,
1479
+ "step": 236
1480
+ },
1481
+ {
1482
+ "epoch": 0.61,
1483
+ "learning_rate": 3.194834466590136e-06,
1484
+ "loss": 0.4985,
1485
+ "step": 237
1486
+ },
1487
+ {
1488
+ "epoch": 0.61,
1489
+ "learning_rate": 3.1591818570891573e-06,
1490
+ "loss": 0.6181,
1491
+ "step": 238
1492
+ },
1493
+ {
1494
+ "epoch": 0.61,
1495
+ "learning_rate": 3.1236213749673555e-06,
1496
+ "loss": 0.5648,
1497
+ "step": 239
1498
+ },
1499
+ {
1500
+ "epoch": 0.61,
1501
+ "learning_rate": 3.0881554635792124e-06,
1502
+ "loss": 0.6169,
1503
+ "step": 240
1504
+ },
1505
+ {
1506
+ "epoch": 0.62,
1507
+ "learning_rate": 3.0527865597812755e-06,
1508
+ "loss": 0.6264,
1509
+ "step": 241
1510
+ },
1511
+ {
1512
+ "epoch": 0.62,
1513
+ "learning_rate": 3.017517093764711e-06,
1514
+ "loss": 0.5292,
1515
+ "step": 242
1516
+ },
1517
+ {
1518
+ "epoch": 0.62,
1519
+ "learning_rate": 2.982349488888337e-06,
1520
+ "loss": 0.5578,
1521
+ "step": 243
1522
+ },
1523
+ {
1524
+ "epoch": 0.62,
1525
+ "learning_rate": 2.9472861615121117e-06,
1526
+ "loss": 0.4992,
1527
+ "step": 244
1528
+ },
1529
+ {
1530
+ "epoch": 0.63,
1531
+ "learning_rate": 2.9123295208311007e-06,
1532
+ "loss": 0.5403,
1533
+ "step": 245
1534
+ },
1535
+ {
1536
+ "epoch": 0.63,
1537
+ "learning_rate": 2.877481968709948e-06,
1538
+ "loss": 0.5385,
1539
+ "step": 246
1540
+ },
1541
+ {
1542
+ "epoch": 0.63,
1543
+ "learning_rate": 2.842745899517843e-06,
1544
+ "loss": 0.4517,
1545
+ "step": 247
1546
+ },
1547
+ {
1548
+ "epoch": 0.63,
1549
+ "learning_rate": 2.8081236999639976e-06,
1550
+ "loss": 0.5326,
1551
+ "step": 248
1552
+ },
1553
+ {
1554
+ "epoch": 0.64,
1555
+ "learning_rate": 2.7736177489336662e-06,
1556
+ "loss": 0.4631,
1557
+ "step": 249
1558
+ },
1559
+ {
1560
+ "epoch": 0.64,
1561
+ "learning_rate": 2.739230417324686e-06,
1562
+ "loss": 0.5639,
1563
+ "step": 250
1564
+ },
1565
+ {
1566
+ "epoch": 0.64,
1567
+ "learning_rate": 2.7049640678845723e-06,
1568
+ "loss": 0.4956,
1569
+ "step": 251
1570
+ },
1571
+ {
1572
+ "epoch": 0.64,
1573
+ "learning_rate": 2.6708210550481794e-06,
1574
+ "loss": 0.5281,
1575
+ "step": 252
1576
+ },
1577
+ {
1578
+ "epoch": 0.65,
1579
+ "learning_rate": 2.636803724775927e-06,
1580
+ "loss": 0.2976,
1581
+ "step": 253
1582
+ },
1583
+ {
1584
+ "epoch": 0.65,
1585
+ "learning_rate": 2.6029144143926003e-06,
1586
+ "loss": 0.5906,
1587
+ "step": 254
1588
+ },
1589
+ {
1590
+ "epoch": 0.65,
1591
+ "learning_rate": 2.569155452426767e-06,
1592
+ "loss": 0.5625,
1593
+ "step": 255
1594
+ },
1595
+ {
1596
+ "epoch": 0.65,
1597
+ "learning_rate": 2.5355291584507814e-06,
1598
+ "loss": 0.5878,
1599
+ "step": 256
1600
+ },
1601
+ {
1602
+ "epoch": 0.66,
1603
+ "learning_rate": 2.5020378429213904e-06,
1604
+ "loss": 0.595,
1605
+ "step": 257
1606
+ },
1607
+ {
1608
+ "epoch": 0.66,
1609
+ "learning_rate": 2.4686838070210094e-06,
1610
+ "loss": 0.3659,
1611
+ "step": 258
1612
+ },
1613
+ {
1614
+ "epoch": 0.66,
1615
+ "learning_rate": 2.435469342499587e-06,
1616
+ "loss": 0.5602,
1617
+ "step": 259
1618
+ },
1619
+ {
1620
+ "epoch": 0.66,
1621
+ "learning_rate": 2.402396731517147e-06,
1622
+ "loss": 0.4733,
1623
+ "step": 260
1624
+ },
1625
+ {
1626
+ "epoch": 0.67,
1627
+ "learning_rate": 2.3694682464869856e-06,
1628
+ "loss": 0.5969,
1629
+ "step": 261
1630
+ },
1631
+ {
1632
+ "epoch": 0.67,
1633
+ "learning_rate": 2.336686149919527e-06,
1634
+ "loss": 0.4799,
1635
+ "step": 262
1636
+ },
1637
+ {
1638
+ "epoch": 0.67,
1639
+ "learning_rate": 2.3040526942668707e-06,
1640
+ "loss": 0.3591,
1641
+ "step": 263
1642
+ },
1643
+ {
1644
+ "epoch": 0.68,
1645
+ "learning_rate": 2.271570121768021e-06,
1646
+ "loss": 0.4472,
1647
+ "step": 264
1648
+ },
1649
+ {
1650
+ "epoch": 0.68,
1651
+ "learning_rate": 2.239240664294837e-06,
1652
+ "loss": 0.4683,
1653
+ "step": 265
1654
+ },
1655
+ {
1656
+ "epoch": 0.68,
1657
+ "learning_rate": 2.2070665431986617e-06,
1658
+ "loss": 0.4283,
1659
+ "step": 266
1660
+ },
1661
+ {
1662
+ "epoch": 0.68,
1663
+ "learning_rate": 2.1750499691577055e-06,
1664
+ "loss": 0.6666,
1665
+ "step": 267
1666
+ },
1667
+ {
1668
+ "epoch": 0.69,
1669
+ "learning_rate": 2.1431931420251543e-06,
1670
+ "loss": 0.5802,
1671
+ "step": 268
1672
+ },
1673
+ {
1674
+ "epoch": 0.69,
1675
+ "learning_rate": 2.1114982506779997e-06,
1676
+ "loss": 0.5939,
1677
+ "step": 269
1678
+ },
1679
+ {
1680
+ "epoch": 0.69,
1681
+ "learning_rate": 2.0799674728666665e-06,
1682
+ "loss": 0.5326,
1683
+ "step": 270
1684
+ },
1685
+ {
1686
+ "epoch": 0.69,
1687
+ "learning_rate": 2.0486029750653605e-06,
1688
+ "loss": 0.6165,
1689
+ "step": 271
1690
+ },
1691
+ {
1692
+ "epoch": 0.7,
1693
+ "learning_rate": 2.017406912323217e-06,
1694
+ "loss": 0.557,
1695
+ "step": 272
1696
+ },
1697
+ {
1698
+ "epoch": 0.7,
1699
+ "learning_rate": 1.986381428116232e-06,
1700
+ "loss": 0.5167,
1701
+ "step": 273
1702
+ },
1703
+ {
1704
+ "epoch": 0.7,
1705
+ "learning_rate": 1.9555286541999766e-06,
1706
+ "loss": 0.6048,
1707
+ "step": 274
1708
+ },
1709
+ {
1710
+ "epoch": 0.7,
1711
+ "learning_rate": 1.92485071046313e-06,
1712
+ "loss": 0.4407,
1713
+ "step": 275
1714
+ },
1715
+ {
1716
+ "epoch": 0.71,
1717
+ "learning_rate": 1.8943497047818179e-06,
1718
+ "loss": 0.6113,
1719
+ "step": 276
1720
+ },
1721
+ {
1722
+ "epoch": 0.71,
1723
+ "learning_rate": 1.864027732874788e-06,
1724
+ "loss": 0.7606,
1725
+ "step": 277
1726
+ },
1727
+ {
1728
+ "epoch": 0.71,
1729
+ "learning_rate": 1.8338868781594052e-06,
1730
+ "loss": 0.6256,
1731
+ "step": 278
1732
+ },
1733
+ {
1734
+ "epoch": 0.71,
1735
+ "learning_rate": 1.8039292116085027e-06,
1736
+ "loss": 0.6035,
1737
+ "step": 279
1738
+ },
1739
+ {
1740
+ "epoch": 0.72,
1741
+ "learning_rate": 1.7741567916080946e-06,
1742
+ "loss": 0.5717,
1743
+ "step": 280
1744
+ },
1745
+ {
1746
+ "epoch": 0.72,
1747
+ "learning_rate": 1.7445716638159248e-06,
1748
+ "loss": 0.475,
1749
+ "step": 281
1750
+ },
1751
+ {
1752
+ "epoch": 0.72,
1753
+ "learning_rate": 1.715175861020934e-06,
1754
+ "loss": 0.4888,
1755
+ "step": 282
1756
+ },
1757
+ {
1758
+ "epoch": 0.72,
1759
+ "learning_rate": 1.6859714030035694e-06,
1760
+ "loss": 0.5913,
1761
+ "step": 283
1762
+ },
1763
+ {
1764
+ "epoch": 0.73,
1765
+ "learning_rate": 1.6569602963970123e-06,
1766
+ "loss": 0.3934,
1767
+ "step": 284
1768
+ },
1769
+ {
1770
+ "epoch": 0.73,
1771
+ "learning_rate": 1.628144534549307e-06,
1772
+ "loss": 0.5487,
1773
+ "step": 285
1774
+ },
1775
+ {
1776
+ "epoch": 0.73,
1777
+ "learning_rate": 1.5995260973863885e-06,
1778
+ "loss": 0.494,
1779
+ "step": 286
1780
+ },
1781
+ {
1782
+ "epoch": 0.73,
1783
+ "learning_rate": 1.5711069512760497e-06,
1784
+ "loss": 0.4315,
1785
+ "step": 287
1786
+ },
1787
+ {
1788
+ "epoch": 0.74,
1789
+ "learning_rate": 1.5428890488928284e-06,
1790
+ "loss": 0.4848,
1791
+ "step": 288
1792
+ },
1793
+ {
1794
+ "epoch": 0.74,
1795
+ "learning_rate": 1.5148743290838455e-06,
1796
+ "loss": 0.4264,
1797
+ "step": 289
1798
+ },
1799
+ {
1800
+ "epoch": 0.74,
1801
+ "learning_rate": 1.4870647167355795e-06,
1802
+ "loss": 0.6395,
1803
+ "step": 290
1804
+ },
1805
+ {
1806
+ "epoch": 0.74,
1807
+ "learning_rate": 1.4594621226416103e-06,
1808
+ "loss": 0.5726,
1809
+ "step": 291
1810
+ },
1811
+ {
1812
+ "epoch": 0.75,
1813
+ "learning_rate": 1.4320684433713354e-06,
1814
+ "loss": 0.5543,
1815
+ "step": 292
1816
+ },
1817
+ {
1818
+ "epoch": 0.75,
1819
+ "learning_rate": 1.4048855611396424e-06,
1820
+ "loss": 0.6457,
1821
+ "step": 293
1822
+ },
1823
+ {
1824
+ "epoch": 0.75,
1825
+ "learning_rate": 1.3779153436776006e-06,
1826
+ "loss": 0.3915,
1827
+ "step": 294
1828
+ },
1829
+ {
1830
+ "epoch": 0.75,
1831
+ "learning_rate": 1.351159644104115e-06,
1832
+ "loss": 0.5187,
1833
+ "step": 295
1834
+ },
1835
+ {
1836
+ "epoch": 0.76,
1837
+ "learning_rate": 1.3246203007986048e-06,
1838
+ "loss": 0.6015,
1839
+ "step": 296
1840
+ },
1841
+ {
1842
+ "epoch": 0.76,
1843
+ "learning_rate": 1.298299137274692e-06,
1844
+ "loss": 0.6067,
1845
+ "step": 297
1846
+ },
1847
+ {
1848
+ "epoch": 0.76,
1849
+ "learning_rate": 1.2721979620548955e-06,
1850
+ "loss": 0.4431,
1851
+ "step": 298
1852
+ },
1853
+ {
1854
+ "epoch": 0.76,
1855
+ "learning_rate": 1.2463185685463859e-06,
1856
+ "loss": 0.3615,
1857
+ "step": 299
1858
+ },
1859
+ {
1860
+ "epoch": 0.77,
1861
+ "learning_rate": 1.220662734917746e-06,
1862
+ "loss": 0.5003,
1863
+ "step": 300
1864
+ },
1865
+ {
1866
+ "epoch": 0.77,
1867
+ "eval_accuracy": 0.8336343532396525,
1868
+ "eval_accuracy_<|content|>": 1.0,
1869
+ "eval_accuracy_<|from|>": 0.9873577749683944,
1870
+ "eval_accuracy_<|recipient|>": 1.0,
1871
+ "eval_accuracy_<|stop|>": 0.91359929468812,
1872
+ "eval_accuracy_total_num_<|content|>": 5362,
1873
+ "eval_accuracy_total_num_<|from|>": 791,
1874
+ "eval_accuracy_total_num_<|recipient|>": 791,
1875
+ "eval_accuracy_total_num_<|stop|>": 4537,
1876
+ "eval_loss": 0.5285482406616211,
1877
+ "eval_perplexity": 1.0587839466417142,
1878
+ "eval_runtime": 223.59,
1879
+ "eval_samples_per_second": 6.15,
1880
+ "eval_steps_per_second": 0.192,
1881
+ "step": 300
1882
+ },
1883
+ {
1884
+ "epoch": 0.77,
1885
+ "learning_rate": 1.1952322239767983e-06,
1886
+ "loss": 0.3402,
1887
+ "step": 301
1888
+ },
1889
+ {
1890
+ "epoch": 0.77,
1891
+ "learning_rate": 1.170028783049487e-06,
1892
+ "loss": 0.6256,
1893
+ "step": 302
1894
+ },
1895
+ {
1896
+ "epoch": 0.77,
1897
+ "learning_rate": 1.1450541438598118e-06,
1898
+ "loss": 0.5378,
1899
+ "step": 303
1900
+ },
1901
+ {
1902
+ "epoch": 0.78,
1903
+ "learning_rate": 1.1203100224108464e-06,
1904
+ "loss": 0.5334,
1905
+ "step": 304
1906
+ },
1907
+ {
1908
+ "epoch": 0.78,
1909
+ "learning_rate": 1.09579811886683e-06,
1910
+ "loss": 0.4257,
1911
+ "step": 305
1912
+ },
1913
+ {
1914
+ "epoch": 0.78,
1915
+ "learning_rate": 1.0715201174363525e-06,
1916
+ "loss": 0.4915,
1917
+ "step": 306
1918
+ },
1919
+ {
1920
+ "epoch": 0.79,
1921
+ "learning_rate": 1.0474776862566299e-06,
1922
+ "loss": 0.4102,
1923
+ "step": 307
1924
+ },
1925
+ {
1926
+ "epoch": 0.79,
1927
+ "learning_rate": 1.0236724772788846e-06,
1928
+ "loss": 0.3463,
1929
+ "step": 308
1930
+ },
1931
+ {
1932
+ "epoch": 0.79,
1933
+ "learning_rate": 1.00010612615485e-06,
1934
+ "loss": 0.5686,
1935
+ "step": 309
1936
+ },
1937
+ {
1938
+ "epoch": 0.79,
1939
+ "learning_rate": 9.76780252124369e-07,
1940
+ "loss": 0.5914,
1941
+ "step": 310
1942
+ },
1943
+ {
1944
+ "epoch": 0.8,
1945
+ "learning_rate": 9.536964579041548e-07,
1946
+ "loss": 0.6401,
1947
+ "step": 311
1948
+ },
1949
+ {
1950
+ "epoch": 0.8,
1951
+ "learning_rate": 9.308563295776531e-07,
1952
+ "loss": 0.5474,
1953
+ "step": 312
1954
+ },
1955
+ {
1956
+ "epoch": 0.8,
1957
+ "learning_rate": 9.082614364860701e-07,
1958
+ "loss": 0.4519,
1959
+ "step": 313
1960
+ },
1961
+ {
1962
+ "epoch": 0.8,
1963
+ "learning_rate": 8.859133311205453e-07,
1964
+ "loss": 0.5629,
1965
+ "step": 314
1966
+ },
1967
+ {
1968
+ "epoch": 0.81,
1969
+ "learning_rate": 8.638135490154735e-07,
1970
+ "loss": 0.4641,
1971
+ "step": 315
1972
+ },
1973
+ {
1974
+ "epoch": 0.81,
1975
+ "learning_rate": 8.419636086430022e-07,
1976
+ "loss": 0.4696,
1977
+ "step": 316
1978
+ },
1979
+ {
1980
+ "epoch": 0.81,
1981
+ "learning_rate": 8.203650113086972e-07,
1982
+ "loss": 0.538,
1983
+ "step": 317
1984
+ },
1985
+ {
1986
+ "epoch": 0.81,
1987
+ "learning_rate": 7.990192410483916e-07,
1988
+ "loss": 0.6092,
1989
+ "step": 318
1990
+ },
1991
+ {
1992
+ "epoch": 0.82,
1993
+ "learning_rate": 7.779277645262103e-07,
1994
+ "loss": 0.407,
1995
+ "step": 319
1996
+ },
1997
+ {
1998
+ "epoch": 0.82,
1999
+ "learning_rate": 7.570920309338014e-07,
2000
+ "loss": 0.5779,
2001
+ "step": 320
2002
+ },
2003
+ {
2004
+ "epoch": 0.82,
2005
+ "learning_rate": 7.365134718907647e-07,
2006
+ "loss": 0.5572,
2007
+ "step": 321
2008
+ },
2009
+ {
2010
+ "epoch": 0.82,
2011
+ "learning_rate": 7.161935013462746e-07,
2012
+ "loss": 0.6165,
2013
+ "step": 322
2014
+ },
2015
+ {
2016
+ "epoch": 0.83,
2017
+ "learning_rate": 6.961335154819422e-07,
2018
+ "loss": 0.5908,
2019
+ "step": 323
2020
+ },
2021
+ {
2022
+ "epoch": 0.83,
2023
+ "learning_rate": 6.763348926158732e-07,
2024
+ "loss": 0.5739,
2025
+ "step": 324
2026
+ },
2027
+ {
2028
+ "epoch": 0.83,
2029
+ "learning_rate": 6.567989931079675e-07,
2030
+ "loss": 0.5375,
2031
+ "step": 325
2032
+ },
2033
+ {
2034
+ "epoch": 0.83,
2035
+ "learning_rate": 6.375271592664525e-07,
2036
+ "loss": 0.5726,
2037
+ "step": 326
2038
+ },
2039
+ {
2040
+ "epoch": 0.84,
2041
+ "learning_rate": 6.18520715255646e-07,
2042
+ "loss": 0.5559,
2043
+ "step": 327
2044
+ },
2045
+ {
2046
+ "epoch": 0.84,
2047
+ "learning_rate": 5.997809670049795e-07,
2048
+ "loss": 0.5309,
2049
+ "step": 328
2050
+ },
2051
+ {
2052
+ "epoch": 0.84,
2053
+ "learning_rate": 5.813092021192639e-07,
2054
+ "loss": 0.5068,
2055
+ "step": 329
2056
+ },
2057
+ {
2058
+ "epoch": 0.84,
2059
+ "learning_rate": 5.631066897902227e-07,
2060
+ "loss": 0.4953,
2061
+ "step": 330
2062
+ },
2063
+ {
2064
+ "epoch": 0.85,
2065
+ "learning_rate": 5.451746807092811e-07,
2066
+ "loss": 0.5307,
2067
+ "step": 331
2068
+ },
2069
+ {
2070
+ "epoch": 0.85,
2071
+ "learning_rate": 5.275144069816338e-07,
2072
+ "loss": 0.5182,
2073
+ "step": 332
2074
+ },
2075
+ {
2076
+ "epoch": 0.85,
2077
+ "learning_rate": 5.101270820415908e-07,
2078
+ "loss": 0.5599,
2079
+ "step": 333
2080
+ },
2081
+ {
2082
+ "epoch": 0.85,
2083
+ "learning_rate": 4.930139005691915e-07,
2084
+ "loss": 0.4452,
2085
+ "step": 334
2086
+ },
2087
+ {
2088
+ "epoch": 0.86,
2089
+ "learning_rate": 4.761760384081335e-07,
2090
+ "loss": 0.5605,
2091
+ "step": 335
2092
+ },
2093
+ {
2094
+ "epoch": 0.86,
2095
+ "learning_rate": 4.596146524849682e-07,
2096
+ "loss": 0.3468,
2097
+ "step": 336
2098
+ },
2099
+ {
2100
+ "epoch": 0.86,
2101
+ "learning_rate": 4.433308807296139e-07,
2102
+ "loss": 0.5684,
2103
+ "step": 337
2104
+ },
2105
+ {
2106
+ "epoch": 0.86,
2107
+ "learning_rate": 4.2732584199717023e-07,
2108
+ "loss": 0.3672,
2109
+ "step": 338
2110
+ },
2111
+ {
2112
+ "epoch": 0.87,
2113
+ "learning_rate": 4.116006359910375e-07,
2114
+ "loss": 0.5228,
2115
+ "step": 339
2116
+ },
2117
+ {
2118
+ "epoch": 0.87,
2119
+ "learning_rate": 3.961563431873597e-07,
2120
+ "loss": 0.4777,
2121
+ "step": 340
2122
+ },
2123
+ {
2124
+ "epoch": 0.87,
2125
+ "learning_rate": 3.809940247607826e-07,
2126
+ "loss": 0.4455,
2127
+ "step": 341
2128
+ },
2129
+ {
2130
+ "epoch": 0.87,
2131
+ "learning_rate": 3.661147225115447e-07,
2132
+ "loss": 0.5601,
2133
+ "step": 342
2134
+ },
2135
+ {
2136
+ "epoch": 0.88,
2137
+ "learning_rate": 3.515194587938898e-07,
2138
+ "loss": 0.5627,
2139
+ "step": 343
2140
+ },
2141
+ {
2142
+ "epoch": 0.88,
2143
+ "learning_rate": 3.3720923644582554e-07,
2144
+ "loss": 0.4795,
2145
+ "step": 344
2146
+ },
2147
+ {
2148
+ "epoch": 0.88,
2149
+ "learning_rate": 3.23185038720218e-07,
2150
+ "loss": 0.481,
2151
+ "step": 345
2152
+ },
2153
+ {
2154
+ "epoch": 0.88,
2155
+ "learning_rate": 3.0944782921722787e-07,
2156
+ "loss": 0.4925,
2157
+ "step": 346
2158
+ },
2159
+ {
2160
+ "epoch": 0.89,
2161
+ "learning_rate": 2.959985518181108e-07,
2162
+ "loss": 0.481,
2163
+ "step": 347
2164
+ },
2165
+ {
2166
+ "epoch": 0.89,
2167
+ "learning_rate": 2.8283813062035433e-07,
2168
+ "loss": 0.462,
2169
+ "step": 348
2170
+ },
2171
+ {
2172
+ "epoch": 0.89,
2173
+ "learning_rate": 2.699674698741872e-07,
2174
+ "loss": 0.5985,
2175
+ "step": 349
2176
+ },
2177
+ {
2178
+ "epoch": 0.9,
2179
+ "learning_rate": 2.5738745392045136e-07,
2180
+ "loss": 0.4534,
2181
+ "step": 350
2182
+ },
2183
+ {
2184
+ "epoch": 0.9,
2185
+ "learning_rate": 2.4509894712983346e-07,
2186
+ "loss": 0.4966,
2187
+ "step": 351
2188
+ },
2189
+ {
2190
+ "epoch": 0.9,
2191
+ "learning_rate": 2.3310279384347738e-07,
2192
+ "loss": 0.4829,
2193
+ "step": 352
2194
+ },
2195
+ {
2196
+ "epoch": 0.9,
2197
+ "learning_rate": 2.2139981831496808e-07,
2198
+ "loss": 0.5734,
2199
+ "step": 353
2200
+ },
2201
+ {
2202
+ "epoch": 0.91,
2203
+ "learning_rate": 2.099908246537011e-07,
2204
+ "loss": 0.3882,
2205
+ "step": 354
2206
+ },
2207
+ {
2208
+ "epoch": 0.91,
2209
+ "learning_rate": 1.9887659676962695e-07,
2210
+ "loss": 0.508,
2211
+ "step": 355
2212
+ },
2213
+ {
2214
+ "epoch": 0.91,
2215
+ "learning_rate": 1.8805789831939146e-07,
2216
+ "loss": 0.3879,
2217
+ "step": 356
2218
+ },
2219
+ {
2220
+ "epoch": 0.91,
2221
+ "learning_rate": 1.7753547265386706e-07,
2222
+ "loss": 0.4153,
2223
+ "step": 357
2224
+ },
2225
+ {
2226
+ "epoch": 0.92,
2227
+ "learning_rate": 1.6731004276707185e-07,
2228
+ "loss": 0.5335,
2229
+ "step": 358
2230
+ },
2231
+ {
2232
+ "epoch": 0.92,
2233
+ "learning_rate": 1.5738231124649955e-07,
2234
+ "loss": 0.5547,
2235
+ "step": 359
2236
+ },
2237
+ {
2238
+ "epoch": 0.92,
2239
+ "learning_rate": 1.4775296022483897e-07,
2240
+ "loss": 0.4444,
2241
+ "step": 360
2242
+ },
2243
+ {
2244
+ "epoch": 0.92,
2245
+ "learning_rate": 1.3842265133310762e-07,
2246
+ "loss": 0.536,
2247
+ "step": 361
2248
+ },
2249
+ {
2250
+ "epoch": 0.93,
2251
+ "learning_rate": 1.2939202565519243e-07,
2252
+ "loss": 0.4738,
2253
+ "step": 362
2254
+ },
2255
+ {
2256
+ "epoch": 0.93,
2257
+ "learning_rate": 1.2066170368379763e-07,
2258
+ "loss": 0.4359,
2259
+ "step": 363
2260
+ },
2261
+ {
2262
+ "epoch": 0.93,
2263
+ "learning_rate": 1.1223228527781281e-07,
2264
+ "loss": 0.4132,
2265
+ "step": 364
2266
+ },
2267
+ {
2268
+ "epoch": 0.93,
2269
+ "learning_rate": 1.0410434962109633e-07,
2270
+ "loss": 0.514,
2271
+ "step": 365
2272
+ },
2273
+ {
2274
+ "epoch": 0.94,
2275
+ "learning_rate": 9.627845518268069e-08,
2276
+ "loss": 0.4843,
2277
+ "step": 366
2278
+ },
2279
+ {
2280
+ "epoch": 0.94,
2281
+ "learning_rate": 8.875513967839826e-08,
2282
+ "loss": 0.4589,
2283
+ "step": 367
2284
+ },
2285
+ {
2286
+ "epoch": 0.94,
2287
+ "learning_rate": 8.15349200339363e-08,
2288
+ "loss": 0.5233,
2289
+ "step": 368
2290
+ },
2291
+ {
2292
+ "epoch": 0.94,
2293
+ "learning_rate": 7.461829234931988e-08,
2294
+ "loss": 0.3706,
2295
+ "step": 369
2296
+ },
2297
+ {
2298
+ "epoch": 0.95,
2299
+ "learning_rate": 6.800573186482111e-08,
2300
+ "loss": 0.4884,
2301
+ "step": 370
2302
+ },
2303
+ {
2304
+ "epoch": 0.95,
2305
+ "learning_rate": 6.169769292831134e-08,
2306
+ "loss": 0.4554,
2307
+ "step": 371
2308
+ },
2309
+ {
2310
+ "epoch": 0.95,
2311
+ "learning_rate": 5.569460896403755e-08,
2312
+ "loss": 0.4819,
2313
+ "step": 372
2314
+ },
2315
+ {
2316
+ "epoch": 0.95,
2317
+ "learning_rate": 4.999689244284472e-08,
2318
+ "loss": 0.49,
2319
+ "step": 373
2320
+ },
2321
+ {
2322
+ "epoch": 0.96,
2323
+ "learning_rate": 4.4604934853834986e-08,
2324
+ "loss": 0.6322,
2325
+ "step": 374
2326
+ },
2327
+ {
2328
+ "epoch": 0.96,
2329
+ "learning_rate": 3.9519106677467664e-08,
2330
+ "loss": 0.4662,
2331
+ "step": 375
2332
+ },
2333
+ {
2334
+ "epoch": 0.96,
2335
+ "eval_accuracy": 0.835875467640784,
2336
+ "eval_accuracy_<|content|>": 1.0,
2337
+ "eval_accuracy_<|from|>": 0.9949431099873578,
2338
+ "eval_accuracy_<|recipient|>": 1.0,
2339
+ "eval_accuracy_<|stop|>": 0.9279259422525898,
2340
+ "eval_accuracy_total_num_<|content|>": 5362,
2341
+ "eval_accuracy_total_num_<|from|>": 791,
2342
+ "eval_accuracy_total_num_<|recipient|>": 791,
2343
+ "eval_accuracy_total_num_<|stop|>": 4537,
2344
+ "eval_loss": 0.5191700458526611,
2345
+ "eval_perplexity": 1.057695465829475,
2346
+ "eval_runtime": 223.7003,
2347
+ "eval_samples_per_second": 6.147,
2348
+ "eval_steps_per_second": 0.192,
2349
+ "step": 375
2350
+ },
2351
+ {
2352
+ "epoch": 0.96,
2353
+ "learning_rate": 3.4739757360103594e-08,
2354
+ "loss": 0.4331,
2355
+ "step": 376
2356
+ },
2357
+ {
2358
+ "epoch": 0.96,
2359
+ "learning_rate": 3.026721528999487e-08,
2360
+ "loss": 0.4685,
2361
+ "step": 377
2362
+ },
2363
+ {
2364
+ "epoch": 0.97,
2365
+ "learning_rate": 2.6101787774722885e-08,
2366
+ "loss": 0.479,
2367
+ "step": 378
2368
+ },
2369
+ {
2370
+ "epoch": 0.97,
2371
+ "learning_rate": 2.224376102007919e-08,
2372
+ "loss": 0.5054,
2373
+ "step": 379
2374
+ },
2375
+ {
2376
+ "epoch": 0.97,
2377
+ "learning_rate": 1.8693400110406267e-08,
2378
+ "loss": 0.5825,
2379
+ "step": 380
2380
+ },
2381
+ {
2382
+ "epoch": 0.97,
2383
+ "learning_rate": 1.5450948990378077e-08,
2384
+ "loss": 0.4259,
2385
+ "step": 381
2386
+ },
2387
+ {
2388
+ "epoch": 0.98,
2389
+ "learning_rate": 1.2516630448241973e-08,
2390
+ "loss": 0.4667,
2391
+ "step": 382
2392
+ },
2393
+ {
2394
+ "epoch": 0.98,
2395
+ "learning_rate": 9.890646100509937e-09,
2396
+ "loss": 0.5553,
2397
+ "step": 383
2398
+ },
2399
+ {
2400
+ "epoch": 0.98,
2401
+ "learning_rate": 7.573176378104674e-09,
2402
+ "loss": 0.3993,
2403
+ "step": 384
2404
+ },
2405
+ {
2406
+ "epoch": 0.98,
2407
+ "learning_rate": 5.564380513964518e-09,
2408
+ "loss": 0.4619,
2409
+ "step": 385
2410
+ },
2411
+ {
2412
+ "epoch": 0.99,
2413
+ "learning_rate": 3.864396532100689e-09,
2414
+ "loss": 0.4039,
2415
+ "step": 386
2416
+ },
2417
+ {
2418
+ "epoch": 0.99,
2419
+ "learning_rate": 2.4733412381143796e-09,
2420
+ "loss": 0.4518,
2421
+ "step": 387
2422
+ },
2423
+ {
2424
+ "epoch": 0.99,
2425
+ "learning_rate": 1.3913102111691722e-09,
2426
+ "loss": 0.4452,
2427
+ "step": 388
2428
+ },
2429
+ {
2430
+ "epoch": 0.99,
2431
+ "learning_rate": 6.183777974287929e-10,
2432
+ "loss": 0.3681,
2433
+ "step": 389
2434
+ },
2435
+ {
2436
+ "epoch": 1.0,
2437
+ "learning_rate": 1.5459710494220015e-10,
2438
+ "loss": 0.3691,
2439
+ "step": 390
2440
+ },
2441
+ {
2442
+ "epoch": 1.0,
2443
+ "learning_rate": 0.0,
2444
+ "loss": 0.3283,
2445
+ "step": 391
2446
+ }
2447
+ ],
2448
+ "logging_steps": 1.0,
2449
+ "max_steps": 391,
2450
+ "num_train_epochs": 1,
2451
+ "save_steps": 200.0,
2452
+ "total_flos": 163734890741760.0,
2453
+ "trial_name": null,
2454
+ "trial_params": null
2455
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9511740c3ca0b424f52eb94ea49ef552d068b43f5269339d6e9938c0aaef5c3a
3
+ size 5179
zero_to_fp32.py ADDED
@@ -0,0 +1,587 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ # Copyright (c) Microsoft Corporation.
4
+ # SPDX-License-Identifier: Apache-2.0
5
+
6
+ # DeepSpeed Team
7
+
8
+ # This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
9
+ # copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
10
+ # the future. Once extracted, the weights don't require DeepSpeed and can be used in any
11
+ # application.
12
+ #
13
+ # example: python zero_to_fp32.py . pytorch_model.bin
14
+
15
+ import argparse
16
+ import torch
17
+ import glob
18
+ import math
19
+ import os
20
+ import re
21
+ from collections import OrderedDict
22
+ from dataclasses import dataclass
23
+
24
+ # while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
25
+ # DeepSpeed data structures it has to be available in the current python environment.
26
+ from deepspeed.utils import logger
27
+ from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
28
+ FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
29
+ FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
30
+
31
+
32
+ @dataclass
33
+ class zero_model_state:
34
+ buffers: dict()
35
+ param_shapes: dict()
36
+ shared_params: list
37
+ ds_version: int
38
+ frozen_param_shapes: dict()
39
+ frozen_param_fragments: dict()
40
+
41
+
42
+ debug = 0
43
+
44
+ # load to cpu
45
+ device = torch.device('cpu')
46
+
47
+
48
+ def atoi(text):
49
+ return int(text) if text.isdigit() else text
50
+
51
+
52
+ def natural_keys(text):
53
+ '''
54
+ alist.sort(key=natural_keys) sorts in human order
55
+ http://nedbatchelder.com/blog/200712/human_sorting.html
56
+ (See Toothy's implementation in the comments)
57
+ '''
58
+ return [atoi(c) for c in re.split(r'(\d+)', text)]
59
+
60
+
61
+ def get_model_state_file(checkpoint_dir, zero_stage):
62
+ if not os.path.isdir(checkpoint_dir):
63
+ raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
64
+
65
+ # there should be only one file
66
+ if zero_stage <= 2:
67
+ file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
68
+ elif zero_stage == 3:
69
+ file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
70
+
71
+ if not os.path.exists(file):
72
+ raise FileNotFoundError(f"can't find model states file at '{file}'")
73
+
74
+ return file
75
+
76
+
77
+ def get_checkpoint_files(checkpoint_dir, glob_pattern):
78
+ # XXX: need to test that this simple glob rule works for multi-node setup too
79
+ ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
80
+
81
+ if len(ckpt_files) == 0:
82
+ raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
83
+
84
+ return ckpt_files
85
+
86
+
87
+ def get_optim_files(checkpoint_dir):
88
+ return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
89
+
90
+
91
+ def get_model_state_files(checkpoint_dir):
92
+ return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
93
+
94
+
95
+ def parse_model_states(files):
96
+ zero_model_states = []
97
+ for file in files:
98
+ state_dict = torch.load(file, map_location=device)
99
+
100
+ if BUFFER_NAMES not in state_dict:
101
+ raise ValueError(f"{file} is not a model state checkpoint")
102
+ buffer_names = state_dict[BUFFER_NAMES]
103
+ if debug:
104
+ print("Found buffers:", buffer_names)
105
+
106
+ # recover just the buffers while restoring them to fp32 if they were saved in fp16
107
+ buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
108
+ param_shapes = state_dict[PARAM_SHAPES]
109
+
110
+ # collect parameters that are included in param_shapes
111
+ param_names = []
112
+ for s in param_shapes:
113
+ for name in s.keys():
114
+ param_names.append(name)
115
+
116
+ # update with frozen parameters
117
+ frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
118
+ if frozen_param_shapes is not None:
119
+ if debug:
120
+ print(f"Found frozen_param_shapes: {frozen_param_shapes}")
121
+ param_names += list(frozen_param_shapes.keys())
122
+
123
+ # handle shared params
124
+ shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
125
+
126
+ ds_version = state_dict.get(DS_VERSION, None)
127
+
128
+ frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
129
+
130
+ z_model_state = zero_model_state(buffers=buffers,
131
+ param_shapes=param_shapes,
132
+ shared_params=shared_params,
133
+ ds_version=ds_version,
134
+ frozen_param_shapes=frozen_param_shapes,
135
+ frozen_param_fragments=frozen_param_fragments)
136
+ zero_model_states.append(z_model_state)
137
+
138
+ return zero_model_states
139
+
140
+
141
+ def parse_optim_states(files, ds_checkpoint_dir):
142
+
143
+ total_files = len(files)
144
+ state_dicts = []
145
+ for f in files:
146
+ state_dict = torch.load(f, map_location=device)
147
+ # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
148
+ # and also handle the case where it was already removed by another helper script
149
+ state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
150
+ state_dicts.append(state_dict)
151
+
152
+ if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
153
+ raise ValueError(f"{files[0]} is not a zero checkpoint")
154
+ zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
155
+ world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
156
+
157
+ # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
158
+ # parameters can be different from data parallelism for non-expert parameters. So we can just
159
+ # use the max of the partition_count to get the dp world_size.
160
+
161
+ if type(world_size) is list:
162
+ world_size = max(world_size)
163
+
164
+ if world_size != total_files:
165
+ raise ValueError(
166
+ f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
167
+ "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
168
+ )
169
+
170
+ # the groups are named differently in each stage
171
+ if zero_stage <= 2:
172
+ fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
173
+ elif zero_stage == 3:
174
+ fp32_groups_key = FP32_FLAT_GROUPS
175
+ else:
176
+ raise ValueError(f"unknown zero stage {zero_stage}")
177
+
178
+ if zero_stage <= 2:
179
+ fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
180
+ elif zero_stage == 3:
181
+ # if there is more than one param group, there will be multiple flattened tensors - one
182
+ # flattened tensor per group - for simplicity merge them into a single tensor
183
+ #
184
+ # XXX: could make the script more memory efficient for when there are multiple groups - it
185
+ # will require matching the sub-lists of param_shapes for each param group flattened tensor
186
+
187
+ fp32_flat_groups = [
188
+ torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
189
+ ]
190
+
191
+ return zero_stage, world_size, fp32_flat_groups
192
+
193
+
194
+ def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir):
195
+ """
196
+ Returns fp32 state_dict reconstructed from ds checkpoint
197
+
198
+ Args:
199
+ - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
200
+
201
+ """
202
+ print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
203
+
204
+ optim_files = get_optim_files(ds_checkpoint_dir)
205
+ zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
206
+ print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
207
+
208
+ model_files = get_model_state_files(ds_checkpoint_dir)
209
+
210
+ zero_model_states = parse_model_states(model_files)
211
+ print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
212
+
213
+ if zero_stage <= 2:
214
+ return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states)
215
+ elif zero_stage == 3:
216
+ return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states)
217
+
218
+
219
+ def _zero2_merge_frozen_params(state_dict, zero_model_states):
220
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
221
+ return
222
+
223
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
224
+ frozen_param_fragments = zero_model_states[0].frozen_param_fragments
225
+
226
+ if debug:
227
+ num_elem = sum(s.numel() for s in frozen_param_shapes.values())
228
+ print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
229
+
230
+ wanted_params = len(frozen_param_shapes)
231
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
232
+ avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
233
+ print(f'Frozen params: Have {avail_numel} numels to process.')
234
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
235
+
236
+ total_params = 0
237
+ total_numel = 0
238
+ for name, shape in frozen_param_shapes.items():
239
+ total_params += 1
240
+ unpartitioned_numel = shape.numel()
241
+ total_numel += unpartitioned_numel
242
+
243
+ state_dict[name] = frozen_param_fragments[name]
244
+
245
+ if debug:
246
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
247
+
248
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
249
+
250
+
251
+ def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
252
+ param_shapes = zero_model_states[0].param_shapes
253
+
254
+ # Reconstruction protocol:
255
+ #
256
+ # XXX: document this
257
+
258
+ if debug:
259
+ for i in range(world_size):
260
+ for j in range(len(fp32_flat_groups[0])):
261
+ print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
262
+
263
+ # XXX: memory usage doubles here (zero2)
264
+ num_param_groups = len(fp32_flat_groups[0])
265
+ merged_single_partition_of_fp32_groups = []
266
+ for i in range(num_param_groups):
267
+ merged_partitions = [sd[i] for sd in fp32_flat_groups]
268
+ full_single_fp32_vector = torch.cat(merged_partitions, 0)
269
+ merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
270
+ avail_numel = sum(
271
+ [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
272
+
273
+ if debug:
274
+ wanted_params = sum([len(shapes) for shapes in param_shapes])
275
+ wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
276
+ # not asserting if there is a mismatch due to possible padding
277
+ print(f"Have {avail_numel} numels to process.")
278
+ print(f"Need {wanted_numel} numels in {wanted_params} params.")
279
+
280
+ # params
281
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
282
+ # out-of-core computing solution
283
+ total_numel = 0
284
+ total_params = 0
285
+ for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
286
+ offset = 0
287
+ avail_numel = full_single_fp32_vector.numel()
288
+ for name, shape in shapes.items():
289
+
290
+ unpartitioned_numel = shape.numel()
291
+ total_numel += unpartitioned_numel
292
+ total_params += 1
293
+
294
+ if debug:
295
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
296
+ state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
297
+ offset += unpartitioned_numel
298
+
299
+ # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
300
+ # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
301
+ # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
302
+ # live optimizer object, so we are checking that the numbers are within the right range
303
+ align_to = 2 * world_size
304
+
305
+ def zero2_align(x):
306
+ return align_to * math.ceil(x / align_to)
307
+
308
+ if debug:
309
+ print(f"original offset={offset}, avail_numel={avail_numel}")
310
+
311
+ offset = zero2_align(offset)
312
+ avail_numel = zero2_align(avail_numel)
313
+
314
+ if debug:
315
+ print(f"aligned offset={offset}, avail_numel={avail_numel}")
316
+
317
+ # Sanity check
318
+ if offset != avail_numel:
319
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
320
+
321
+ print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
322
+
323
+
324
+ def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states):
325
+ state_dict = OrderedDict()
326
+
327
+ # buffers
328
+ buffers = zero_model_states[0].buffers
329
+ state_dict.update(buffers)
330
+ if debug:
331
+ print(f"added {len(buffers)} buffers")
332
+
333
+ _zero2_merge_frozen_params(state_dict, zero_model_states)
334
+
335
+ _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
336
+
337
+ # recover shared parameters
338
+ for pair in zero_model_states[0].shared_params:
339
+ if pair[1] in state_dict:
340
+ state_dict[pair[0]] = state_dict[pair[1]]
341
+
342
+ return state_dict
343
+
344
+
345
+ def zero3_partitioned_param_info(unpartitioned_numel, world_size):
346
+ remainder = unpartitioned_numel % world_size
347
+ padding_numel = (world_size - remainder) if remainder else 0
348
+ partitioned_numel = math.ceil(unpartitioned_numel / world_size)
349
+ return partitioned_numel, padding_numel
350
+
351
+
352
+ def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
353
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
354
+ return
355
+
356
+ if debug:
357
+ for i in range(world_size):
358
+ num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
359
+ print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
360
+
361
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
362
+ wanted_params = len(frozen_param_shapes)
363
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
364
+ avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
365
+ print(f'Frozen params: Have {avail_numel} numels to process.')
366
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
367
+
368
+ total_params = 0
369
+ total_numel = 0
370
+ for name, shape in zero_model_states[0].frozen_param_shapes.items():
371
+ total_params += 1
372
+ unpartitioned_numel = shape.numel()
373
+ total_numel += unpartitioned_numel
374
+
375
+ param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
376
+ state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
377
+
378
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
379
+
380
+ if debug:
381
+ print(
382
+ f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
383
+ )
384
+
385
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
386
+
387
+
388
+ def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
389
+ param_shapes = zero_model_states[0].param_shapes
390
+ avail_numel = fp32_flat_groups[0].numel() * world_size
391
+ # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
392
+ # param, re-consolidating each param, while dealing with padding if any
393
+
394
+ # merge list of dicts, preserving order
395
+ param_shapes = {k: v for d in param_shapes for k, v in d.items()}
396
+
397
+ if debug:
398
+ for i in range(world_size):
399
+ print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
400
+
401
+ wanted_params = len(param_shapes)
402
+ wanted_numel = sum(shape.numel() for shape in param_shapes.values())
403
+ # not asserting if there is a mismatch due to possible padding
404
+ avail_numel = fp32_flat_groups[0].numel() * world_size
405
+ print(f"Trainable params: Have {avail_numel} numels to process.")
406
+ print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
407
+
408
+ # params
409
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
410
+ # out-of-core computing solution
411
+ offset = 0
412
+ total_numel = 0
413
+ total_params = 0
414
+ for name, shape in param_shapes.items():
415
+
416
+ unpartitioned_numel = shape.numel()
417
+ total_numel += unpartitioned_numel
418
+ total_params += 1
419
+
420
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
421
+
422
+ if debug:
423
+ print(
424
+ f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
425
+ )
426
+
427
+ # XXX: memory usage doubles here
428
+ state_dict[name] = torch.cat(
429
+ tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
430
+ 0).narrow(0, 0, unpartitioned_numel).view(shape)
431
+ offset += partitioned_numel
432
+
433
+ offset *= world_size
434
+
435
+ # Sanity check
436
+ if offset != avail_numel:
437
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
438
+
439
+ print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
440
+
441
+
442
+ def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states):
443
+ state_dict = OrderedDict()
444
+
445
+ # buffers
446
+ buffers = zero_model_states[0].buffers
447
+ state_dict.update(buffers)
448
+ if debug:
449
+ print(f"added {len(buffers)} buffers")
450
+
451
+ _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
452
+
453
+ _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
454
+
455
+ # recover shared parameters
456
+ for pair in zero_model_states[0].shared_params:
457
+ if pair[1] in state_dict:
458
+ state_dict[pair[0]] = state_dict[pair[1]]
459
+
460
+ return state_dict
461
+
462
+
463
+ def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None):
464
+ """
465
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
466
+ ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
467
+ via a model hub.
468
+
469
+ Args:
470
+ - ``checkpoint_dir``: path to the desired checkpoint folder
471
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
472
+
473
+ Returns:
474
+ - pytorch ``state_dict``
475
+
476
+ Note: this approach may not work if your application doesn't have sufficient free CPU memory and
477
+ you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
478
+ the checkpoint.
479
+
480
+ A typical usage might be ::
481
+
482
+ from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
483
+ # do the training and checkpoint saving
484
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
485
+ model = model.cpu() # move to cpu
486
+ model.load_state_dict(state_dict)
487
+ # submit to model hub or save the model to share with others
488
+
489
+ In this example the ``model`` will no longer be usable in the deepspeed context of the same
490
+ application. i.e. you will need to re-initialize the deepspeed engine, since
491
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
492
+
493
+ If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
494
+
495
+ """
496
+ if tag is None:
497
+ latest_path = os.path.join(checkpoint_dir, 'latest')
498
+ if os.path.isfile(latest_path):
499
+ with open(latest_path, 'r') as fd:
500
+ tag = fd.read().strip()
501
+ else:
502
+ raise ValueError(f"Unable to find 'latest' file at {latest_path}")
503
+
504
+ ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
505
+
506
+ if not os.path.isdir(ds_checkpoint_dir):
507
+ raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
508
+
509
+ return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir)
510
+
511
+
512
+ def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None):
513
+ """
514
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
515
+ loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
516
+
517
+ Args:
518
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
519
+ - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
520
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
521
+ """
522
+
523
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
524
+ print(f"Saving fp32 state dict to {output_file}")
525
+ torch.save(state_dict, output_file)
526
+
527
+
528
+ def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
529
+ """
530
+ 1. Put the provided model to cpu
531
+ 2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
532
+ 3. Load it into the provided model
533
+
534
+ Args:
535
+ - ``model``: the model object to update
536
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
537
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
538
+
539
+ Returns:
540
+ - ``model`: modified model
541
+
542
+ Make sure you have plenty of CPU memory available before you call this function. If you don't
543
+ have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
544
+ conveniently placed for you in the checkpoint folder.
545
+
546
+ A typical usage might be ::
547
+
548
+ from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
549
+ model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
550
+ # submit to model hub or save the model to share with others
551
+
552
+ Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
553
+ of the same application. i.e. you will need to re-initialize the deepspeed engine, since
554
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
555
+
556
+ """
557
+ logger.info(f"Extracting fp32 weights")
558
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
559
+
560
+ logger.info(f"Overwriting model with fp32 weights")
561
+ model = model.cpu()
562
+ model.load_state_dict(state_dict, strict=False)
563
+
564
+ return model
565
+
566
+
567
+ if __name__ == "__main__":
568
+
569
+ parser = argparse.ArgumentParser()
570
+ parser.add_argument("checkpoint_dir",
571
+ type=str,
572
+ help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
573
+ parser.add_argument(
574
+ "output_file",
575
+ type=str,
576
+ help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
577
+ parser.add_argument("-t",
578
+ "--tag",
579
+ type=str,
580
+ default=None,
581
+ help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
582
+ parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
583
+ args = parser.parse_args()
584
+
585
+ debug = args.debug
586
+
587
+ convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir, args.output_file, tag=args.tag)