qihoo360
/

360Zhinao2-7B-Chat-32K

@@ -194,10 +194,10 @@ We used the open-source tool OpenCompass to evaluate the model and compared it w
 4. **Model Merging**
-   Automatic evaluations on 360's white-box evaluation set v4 revealed that different models excel in different skills. A model merging scheme was considered. Using the sft model as the base, interpolation with model v1 was performed, followed by extrapolation between the sft model and model v1 with an extrapolation coefficient of 0.2. This resulted in the final 360Zhicao2-7B-Chat-4k model.
 ### Model Performance
-We evaluated the 360Zhicao2-7B-Chat-4k model on several classic tasks. The IFEval (prompt strict) score was second only to GLM4-9B, making it the highest among open-source 7B models. It ranked third on MT-bench, slightly behind Qwen2.5-7B, and second among 7B models. It placed third on CF-Bench, and for PSR, it was second only to GLM4-9B. Detailed results are shown in the table below:
 | Model                | MT-bench | IFEval(strict prompt) | CFBench(CSR,ISR,PSR) |      |      |
 |----------------------|----------|-----------------------|----------------------|------|------|
@@ -205,7 +205,7 @@ We evaluated the 360Zhicao2-7B-Chat-4k model on several classic tasks. The IFEva
 | Yi-9B-16k-Chat       | 7.44     | 0.455                 | 0.75                 | 0.4  | 0.52 |
 | GLM4-9B-Chat         | **8.08** | **0.634**             | **0.82**             | 0.48 | 0.61 |
 | InternLM2.5-7B-Chat  | 7.39     | 0.540                 | 0.78                 | 0.4  | 0.54 |
-| 360Zhicao2-7B-Chat-4k| 7.86     | **0.577**             | 0.8                  | 0.44 | 0.57 |
 ### Long Context Fine-Tuning
 Similar to the method used during the open-sourcing of 360Zhinao1, we expanded the RoPE base to 1,000,000 and 50,000,000, sequentially concatenated SFT data of mixed long and short texts to 32k and 360k. By combining techniques like gradient checkpointing, ZeRO3 offload, and ring attention, we fine-tuned models to achieve 32k and 360k long context capabilities. These models ranked in the top tier across various 32k benchmarks.

 4. **Model Merging**
+   Automatic evaluations on 360's white-box evaluation set v4 revealed that different models excel in different skills. A model merging scheme was considered to result in the final chat model.
 ### Model Performance
+We evaluated the 360Zhinao2-7B-Chat-4k model on IFEval, MT-bench, CF-bench which are all popular benchmark to evaluate chat model ability. The evaluation results show that 360Zhinao2-7B is highly competitive. Especially the IFEval (prompt strict) score get the highest score among open-source 7B models. only a little worse than GLM4-9B. The detailed results are shown in the table below:
 | Model                | MT-bench | IFEval(strict prompt) | CFBench(CSR,ISR,PSR) |      |      |
 |----------------------|----------|-----------------------|----------------------|------|------|
 | Yi-9B-16k-Chat       | 7.44     | 0.455                 | 0.75                 | 0.4  | 0.52 |
 | GLM4-9B-Chat         | **8.08** | **0.634**             | **0.82**             | 0.48 | 0.61 |
 | InternLM2.5-7B-Chat  | 7.39     | 0.540                 | 0.78                 | 0.4  | 0.54 |
+| 360Zhinao2-7B-Chat-4k| 7.86     | **0.577**             | 0.8                  | 0.44 | 0.57 |
 ### Long Context Fine-Tuning
 Similar to the method used during the open-sourcing of 360Zhinao1, we expanded the RoPE base to 1,000,000 and 50,000,000, sequentially concatenated SFT data of mixed long and short texts to 32k and 360k. By combining techniques like gradient checkpointing, ZeRO3 offload, and ring attention, we fine-tuned models to achieve 32k and 360k long context capabilities. These models ranked in the top tier across various 32k benchmarks.