Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -194,10 +194,10 @@ We used the open-source tool OpenCompass to evaluate the model and compared it w
|
|
194 |
|
195 |
4. **Model Merging**
|
196 |
|
197 |
-
Automatic evaluations on 360's white-box evaluation set v4 revealed that different models excel in different skills. A model merging scheme was considered
|
198 |
|
199 |
### Model Performance
|
200 |
-
We evaluated the
|
201 |
|
202 |
| Model | MT-bench | IFEval(strict prompt) | CFBench(CSR,ISR,PSR) | | |
|
203 |
|----------------------|----------|-----------------------|----------------------|------|------|
|
@@ -205,7 +205,7 @@ We evaluated the 360Zhicao2-7B-Chat-4k model on several classic tasks. The IFEva
|
|
205 |
| Yi-9B-16k-Chat | 7.44 | 0.455 | 0.75 | 0.4 | 0.52 |
|
206 |
| GLM4-9B-Chat | **8.08** | **0.634** | **0.82** | 0.48 | 0.61 |
|
207 |
| InternLM2.5-7B-Chat | 7.39 | 0.540 | 0.78 | 0.4 | 0.54 |
|
208 |
-
|
|
209 |
|
210 |
### Long Context Fine-Tuning
|
211 |
Similar to the method used during the open-sourcing of 360Zhinao1, we expanded the RoPE base to 1,000,000 and 50,000,000, sequentially concatenated SFT data of mixed long and short texts to 32k and 360k. By combining techniques like gradient checkpointing, ZeRO3 offload, and ring attention, we fine-tuned models to achieve 32k and 360k long context capabilities. These models ranked in the top tier across various 32k benchmarks.
|
|
|
194 |
|
195 |
4. **Model Merging**
|
196 |
|
197 |
+
Automatic evaluations on 360's white-box evaluation set v4 revealed that different models excel in different skills. A model merging scheme was considered to result in the final chat model.
|
198 |
|
199 |
### Model Performance
|
200 |
+
We evaluated the 360Zhinao2-7B-Chat-4k model on IFEval, MT-bench, CF-bench which are all popular benchmark to evaluate chat model ability. The evaluation results show that 360Zhinao2-7B is highly competitive. Especially the IFEval (prompt strict) score get the highest score among open-source 7B models. only a little worse than GLM4-9B. The detailed results are shown in the table below:
|
201 |
|
202 |
| Model | MT-bench | IFEval(strict prompt) | CFBench(CSR,ISR,PSR) | | |
|
203 |
|----------------------|----------|-----------------------|----------------------|------|------|
|
|
|
205 |
| Yi-9B-16k-Chat | 7.44 | 0.455 | 0.75 | 0.4 | 0.52 |
|
206 |
| GLM4-9B-Chat | **8.08** | **0.634** | **0.82** | 0.48 | 0.61 |
|
207 |
| InternLM2.5-7B-Chat | 7.39 | 0.540 | 0.78 | 0.4 | 0.54 |
|
208 |
+
| 360Zhinao2-7B-Chat-4k| 7.86 | **0.577** | 0.8 | 0.44 | 0.57 |
|
209 |
|
210 |
### Long Context Fine-Tuning
|
211 |
Similar to the method used during the open-sourcing of 360Zhinao1, we expanded the RoPE base to 1,000,000 and 50,000,000, sequentially concatenated SFT data of mixed long and short texts to 32k and 360k. By combining techniques like gradient checkpointing, ZeRO3 offload, and ring attention, we fine-tuned models to achieve 32k and 360k long context capabilities. These models ranked in the top tier across various 32k benchmarks.
|