zhaicunqi commited on
Commit
40f41e5
1 Parent(s): c2eda12

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -194,10 +194,10 @@ We used the open-source tool OpenCompass to evaluate the model and compared it w
194
 
195
  4. **Model Merging**
196
 
197
- Automatic evaluations on 360's white-box evaluation set v4 revealed that different models excel in different skills. A model merging scheme was considered. Using the sft model as the base, interpolation with model v1 was performed, followed by extrapolation between the sft model and model v1 with an extrapolation coefficient of 0.2. This resulted in the final 360Zhicao2-7B-Chat-4k model.
198
 
199
  ### Model Performance
200
- We evaluated the 360Zhicao2-7B-Chat-4k model on several classic tasks. The IFEval (prompt strict) score was second only to GLM4-9B, making it the highest among open-source 7B models. It ranked third on MT-bench, slightly behind Qwen2.5-7B, and second among 7B models. It placed third on CF-Bench, and for PSR, it was second only to GLM4-9B. Detailed results are shown in the table below:
201
 
202
  | Model | MT-bench | IFEval(strict prompt) | CFBench(CSR,ISR,PSR) | | |
203
  |----------------------|----------|-----------------------|----------------------|------|------|
@@ -205,7 +205,7 @@ We evaluated the 360Zhicao2-7B-Chat-4k model on several classic tasks. The IFEva
205
  | Yi-9B-16k-Chat | 7.44 | 0.455 | 0.75 | 0.4 | 0.52 |
206
  | GLM4-9B-Chat | **8.08** | **0.634** | **0.82** | 0.48 | 0.61 |
207
  | InternLM2.5-7B-Chat | 7.39 | 0.540 | 0.78 | 0.4 | 0.54 |
208
- | 360Zhicao2-7B-Chat-4k| 7.86 | **0.577** | 0.8 | 0.44 | 0.57 |
209
 
210
  ### Long Context Fine-Tuning
211
  Similar to the method used during the open-sourcing of 360Zhinao1, we expanded the RoPE base to 1,000,000 and 50,000,000, sequentially concatenated SFT data of mixed long and short texts to 32k and 360k. By combining techniques like gradient checkpointing, ZeRO3 offload, and ring attention, we fine-tuned models to achieve 32k and 360k long context capabilities. These models ranked in the top tier across various 32k benchmarks.
 
194
 
195
  4. **Model Merging**
196
 
197
+ Automatic evaluations on 360's white-box evaluation set v4 revealed that different models excel in different skills. A model merging scheme was considered to result in the final chat model.
198
 
199
  ### Model Performance
200
+ We evaluated the 360Zhinao2-7B-Chat-4k model on IFEval, MT-bench, CF-bench which are all popular benchmark to evaluate chat model ability. The evaluation results show that 360Zhinao2-7B is highly competitive. Especially the IFEval (prompt strict) score get the highest score among open-source 7B models. only a little worse than GLM4-9B. The detailed results are shown in the table below:
201
 
202
  | Model | MT-bench | IFEval(strict prompt) | CFBench(CSR,ISR,PSR) | | |
203
  |----------------------|----------|-----------------------|----------------------|------|------|
 
205
  | Yi-9B-16k-Chat | 7.44 | 0.455 | 0.75 | 0.4 | 0.52 |
206
  | GLM4-9B-Chat | **8.08** | **0.634** | **0.82** | 0.48 | 0.61 |
207
  | InternLM2.5-7B-Chat | 7.39 | 0.540 | 0.78 | 0.4 | 0.54 |
208
+ | 360Zhinao2-7B-Chat-4k| 7.86 | **0.577** | 0.8 | 0.44 | 0.57 |
209
 
210
  ### Long Context Fine-Tuning
211
  Similar to the method used during the open-sourcing of 360Zhinao1, we expanded the RoPE base to 1,000,000 and 50,000,000, sequentially concatenated SFT data of mixed long and short texts to 32k and 360k. By combining techniques like gradient checkpointing, ZeRO3 offload, and ring attention, we fine-tuned models to achieve 32k and 360k long context capabilities. These models ranked in the top tier across various 32k benchmarks.