IDEA-CCNL
/

Ziya-Writing-13B-v2

@@ -7,14 +7,6 @@ library_name: transformers
 pipeline_tag: text-generation
 ---
-# 姜子牙系列模型
-- [Ziya-LLaMA-13B-v1.1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1.1)
-- [Ziya-LLaMA-13B-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1)
-- [Ziya-LLaMA-7B-Reward](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-7B-Reward)
-- [Ziya-LLaMA-13B-Pretrain-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-Pretrain-v1)
-- [Ziya-BLIP2-14B-Visual-v1](https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1)
 ## 简介 Brief Introduction
 姜子牙写作大模型V2是基于LlaMa-2的130亿参数的指令微调模型，在写作任务上进行了能力增强，是专注于写作的大模型。姜子牙写作模型可以完成公文报告、讲稿书信、创意文案等多类的写作任务。
@@ -44,25 +36,27 @@ pip install torch==1.12.1 tokenizers==0.13.3 git+https://github.com/huggingface/
 最后，我们利用evol-instruct的方法，生成了约30万条高质量的通用指令数据。我们混合了通用指令数据和写作指令数据，这使得ziya-writing-v2不仅拥有良好的意图理解能力，也能够生成优秀的回答。
-### 对齐学习 Alignment training
-我们在实验中发现，利用少量人类标注的高质量的写作排序数据，使用强化学习训练模型，就能对进一步拔高模型的写作效果。
-为了进一步提升模型的表现，使其能够充分理解人类意图、减少“幻觉”和不安全的输出，基于指令微调后的模型，进行了人类反馈训练（Human-Feedback Training，HFT）。在训练中，我们采用了以人类反馈强化学习（RM、PPO）为主。
-我们在内部自研的框架上实现了HFT的训练流程，该框架可以利用最少8张40G的A100显卡完成Ziya-Writing-LLaMA-13B-v1的全参数训练。在PPO训练中，我们没有限制生成样本的长度，以确保长文本任务的奖励准确性。每次训练的总经验池尺寸超过100k样本，确保了训练的充分性。
-In our experiment, we found that by using a small amount of high-quality human-annotated writing ranking data and training the model with reinforcement learning, we could effectively improve the writing performance of the model.
-To further improve the performance of the model, enabling it to fully understand human intentions, reduce "hallucinations" and unsafe outputs, we conducted Human-Feedback Training (HFT) based on the model fine-tuned with instructions. In the training process, we used human feedback reinforcement learning (RM, PPO).
-We implemented the HFT training process on an internally developed framework, which can use a minimum of 8 40GB A100 GPUs to complete the full parameter training of Ziya-Writing-LLaMA-13B-v1. In the PPO training, we did not limit the length of the generated samples to ensure the accuracy of rewards for long-text tasks. The total experience pool size for each training exceeded 100k samples, ensuring the sufficiency of the training.
 ### 效果评估 Performance
-写作文案的优劣评价是一个较为主观的评判，很难用一个准确率或者满意度的打分来衡量。因此，我们使用了匿名模型多人Side-by-Side评估的机制，收集了100条不同难度的写作指令数据进行评估，我们后续也会公开这个评测集。
 我们以胜出率作为评价模型好坏的指标，一个模型的胜出率计算公式为：
@@ -78,16 +72,10 @@ Win Rate = (Number of wins for the model + Number of draws / 2) / Total number o
 Generally, since most language models generate responses based on sampling, hence, a win rate greater than 55% indicates that the model significantly outperforms another model, a win rate less than 45% shows that the model clearly lags behind, and a win rate between 45% and 55% signifies that the two models are essentially on par.
-|  Ziya-Writing-LLaMa-13B-v1  | 平均胜出率       | 最大胜出率      | 最小胜出率    |
-|  :----:  | :----:  | :----:  | :----:  |
-| vs Ziya-LLaMa-13B-v1.1 | 70.7 | 73.5 | 69 |
-| vs baichuan-vicuna-7b | 69.6 | 73.5 | 68 |
-| vs Moss-16B | 65.1 | 69 | 62 |
-| vs ChatGLM2-6B | 58.3 | 61.5 | 56 |
-| vs Minimax-abab5 | 52.3 | 53 | 50.5 |
-| vs GPT-3.5-turbo | 44.7 | 49.5 | 38 |
-（注：最大胜出率和最小胜出率，是对每一个标注人员的标注结果进行单独统计，计算出最大和最小的得分；平均胜出率是对所有标注人员的标注结果进行汇总统计，计算出平均的得分。）
 ## <span id="jump"> 使用 Usage </span>
@@ -101,14 +89,14 @@ import torch
 device = torch.device("cuda")
 query="帮我写一份去西安的旅游计划"
-model = LlamaForCausalLM.from_pretrained("IDEA-CCNL/Ziya-Writing-LLaMa-13B-v1", torch_dtype=torch.float16, device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("IDEA-CCNL/Ziya-Writing-LLaMa-13B-v1", use_fast=False)
 inputs = '<human>:' + query.strip() + '\n<bot>:'
 input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(device)
 generate_ids = model.generate(
             input_ids,
-            max_new_tokens=2048,
             do_sample = True,
             top_p = 0.85,
             temperature = 0.85,
@@ -121,14 +109,6 @@ print(output)
 ```
-## 微调示例 Finetune Example
-Refer to [ziya_finetune](https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/ziya_llama)
-## 推理量化示例 Inference & Quantization Example
-Refer to [ziya_inference](https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/ziya_inference)
 ## 引用 Citation
 如果您在您的工作中使用了我们的模型，可以引用我们的[论文](https://arxiv.org/abs/2210.08590)：

 pipeline_tag: text-generation
 ---
 ## 简介 Brief Introduction
 姜子牙写作大模型V2是基于LlaMa-2的130亿参数的指令微调模型，在写作任务上进行了能力增强，是专注于写作的大模型。姜子牙写作模型可以完成公文报告、讲稿书信、创意文案等多类的写作任务。
 最后，我们利用evol-instruct的方法，生成了约30万条高质量的通用指令数据。我们混合了通用指令数据和写作指令数据，这使得ziya-writing-v2不仅拥有良好的意图理解能力，也能够生成优秀的回答。
+We have collected and cleaned a large amount of authentic human writing data from the internet. Using GPT-3.5, we generated corresponding writing prompts and conducted rigorous manual verification.
+Additionally, we trained an Answer-to-Instruction model to generate high-quality enhanced writing prompt data from unsupervised writing data, further improving the quality of our data.
+Based on this, we carefully selected more challenging writing prompts using a reward model and specific cleaning logic, filtering out simple data and ensuring prompt diversity.
+Finally, using the evol-instruct method, we generated approximately 300,000 high-quality general instruction data. By combining this with the writing prompt data, ziya-writing-v2 not only possesses strong intent understanding capabilities but also generates excellent responses.
+### 对齐学习 Alignment training
+我们使用GPT4、Minimax、Baichuan2、Qwen-14B等优秀的对话模型，对同一个指令生成不同的回答，我们利用奖励模型对不同的回答进行排序，形成偏好数据。
+我们使用了SFT-like Alignment的方法进行对齐训练，我们在内部自研的框架上实现了Alignment的训练流程，训练使用了8k的上下位窗口，一共约2万的偏好数据。
+We use excellent LLMs such as GPT4, Minimax, Baichuan2, Qwen-14B, and generate different responses to the same instruction. We use a reward model to rank the different responses and form preference data.
+We utilize the SFT-like Alignment method for training, implementing the alignment training process on our internally developed framework. The training uses an 8k context window, resulting in approximately 20,000 preference data points.
 ### 效果评估 Performance
+写作文案的优劣��价是一个较为主观的评判，很难用一个准确率或者满意度的打分来衡量。因此，我们使用了匿名模型多人Side-by-Side评估的机制，收集了170条不同难度的写作指令数据进行评估，我们后续也会公开这个评测集。
 我们以胜出率作为评价模型好坏的指标，一个模型的胜出率计算公式为：
 Generally, since most language models generate responses based on sampling, hence, a win rate greater than 55% indicates that the model significantly outperforms another model, a win rate less than 45% shows that the model clearly lags behind, and a win rate between 45% and 55% signifies that the two models are essentially on par.
+|  Ziya-Writing-13B-v2  | 胜出率       |
+|  :----:  | :----:  |
+| vs Ziya-Writing-LLaMa-13B-v1 | 72.5 |
 ## <span id="jump"> 使用 Usage </span>
 device = torch.device("cuda")
 query="帮我写一份去西安的旅游计划"
+model = LlamaForCausalLM.from_pretrained("IDEA-CCNL/Ziya-Writing-13B-v2", torch_dtype=torch.float16, device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained("IDEA-CCNL/Ziya-Writing-13B-v2", use_fast=False)
 inputs = '<human>:' + query.strip() + '\n<bot>:'
 input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(device)
 generate_ids = model.generate(
             input_ids,
+            max_new_tokens=4096,
             do_sample = True,
             top_p = 0.85,
             temperature = 0.85,
 ```
 ## 引用 Citation
 如果您在您的工作中使用了我们的模型，可以引用我们的[论文](https://arxiv.org/abs/2210.08590)：