wenge-research
/

yayi2-30b

Text Generation

Transformers

PyTorch

yayi

custom_code

Model card Files Files and versions Community

wenge-research commited on Dec 14, 2023

Commit

e60ca2e

1 Parent(s): 7af6005

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -151

README.md CHANGED Viewed

@@ -26,8 +26,6 @@ license: apache-2.0
 - [模型地址](#模型地址)
 - [评测结果](#评测结果)
 - [推理](#推理)
-  - [环境安装](#环境安装)
-  - [Base 模型推理代码](#base-模型推理代码)
 - [模型微调](#模型微调)
   - [环境安装](#环境安装-1)
   - [全参训练](#全参训练)
@@ -39,33 +37,35 @@ license: apache-2.0
   - [开源协议](#开源协议)
   - [引用](#引用)
-## 介绍
-YAYI 2 是中科闻歌研发的**新一代开源大语言模型**，包括 Base 和 Chat 版本，参数规模为 30B，并采用了超过 2 万亿 Tokens 的高质量、多语言语料进行预训练。针对通用和特定领域的应用场景，我们采用了百万级指令进行微调，同时借助人类反馈强化学习方法，以更好地使模型与人类价值观对齐。
-本次开源的模型为 YAYI2-30B Base 模型。我们希望通过雅意大模型的开源来促进中文预训练大模型开源社区的发展，并积极为此做出贡献。通过开源，我们与每一位合作伙伴共同构建雅意大模型生态。更多技术细节，敬请期待我们的的技术报告🔥。
-## 模型地址
-| 模型名称  | 上下文长度  | 🤗 HF模型标识 | 下载地址   |
-|:----------|:----------:|:----------:|----------:|
-| YAYI2-30B | 4096    | wenge-research/yayi2-30b| [模型下载](https://huggingface.co/wenge-research/yayi2-30b)|
-## 评测结果
 我们在多个基准数据集上进行了评测，包括 C-Eval、MMLU、 CMMLU、AGIEval、GAOKAO-Bench、GSM8K、MATH、BBH、HumanEval 以及 MBPP。我们考察了模型在语言理解、学科知识、数学推理、逻辑推理以及代码生成方面的表现。YAYI 2 模型在与其规模相近的开源模型中展现出了显著的性能提升。
 <table id="myTable">
   <!-- Table header -->
   <tr>
         <th></th>
-        <th colspan="5" style="text-align: center;">学科知识</th>
-        <th colspan="2" style="text-align: center;">数学</th>
-        <th colspan="1" style="text-align: center;">逻辑推理</th>
-        <th colspan="2" style="text-align: center;">代码</th>
   </tr>
   <tr>
-        <th style="text-align: left;">模型</th>
         <th>C-Eval(val)</th>
         <th>MMLU</th>
         <th>AGIEval</th>
@@ -212,36 +212,10 @@ YAYI 2 是中科闻歌研发的**新一代开源大语言模型**，包括 Base
 我们使用 [OpenCompass Github 仓库](https://github.com/open-compass/opencompass) 提供的源代码进行了评测。对于对比模型，我们列出了他们在 [OpenCompass](https://opencompass.org.cn) 榜单上的评测结果，截止日期为 2023年12月15日。对于其他尚未在 [OpenCompass](https://opencompass.org.cn/leaderboard-llm) 平台参与评测的模型，包括 MPT、Falcon 和 LLaMa 2，我们采用了 [LLaMA 2](https://arxiv.org/abs/2307.09288) 报告的结果。
-## 推理
-我们提供简单的示例来说明如何快速使用 `YAYI2-30B` 进行推理。该示例可在单张 A100/A800 上运行。
-### 环境安装
-1. 克隆本仓库内容到本地环境
-```bash
-git clone https://github.com/wenge-research/YAYI2.git
-cd YAYI2
-```
-2. 创建 conda 虚拟环境
-```bash
-conda create --name yayi_inference_env python=3.10
-conda activate yayi_inference_env
-```
-请注意，本项目需要 Python 3.8 或更高版本。
-3. 安装依赖
-```
-pip install -r requirements.txt
-```
-### Base 模型推理代码
 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -261,125 +235,23 @@ pip install -r requirements.txt
         )
 >>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
 ```
-当您首次访问时，需要下载并加载模型，可能会花费一些时间。
-## 模型微调
-本项目支持基于分布式训练框架 deepspeed 进行指令微调，配置好环境并执行相应脚本即可启动全参数微调或 LoRA 微调。
-### 环境安装
-1. 创建 conda 虚拟环境：
-```bash
-conda create --name yayi_train_env python=3.10
-conda activate yayi_train_env
-```
-2. 安装依赖：
-```bash
-pip install -r requirements.txt
-```
-3. 安装 accelerate：
-```bash
-pip install --upgrade accelerate
-```
-4. 安装 flashattention：
-```bash
-pip install flash-attn==2.0.3 --no-build-isolation
-pip install triton==2.0.0.dev20221202  --no-deps
-```
-### 全参训练
-* 数据格式：参考 `data/yayi_train_example.json`，是一个标准 JSON 文件，每条数据由 `"system" `和 `"conversations"` 组成，其中 `"system"` 为全局角色设定信息，可为空字符串，`"conversations"` 是由 human 和 yayi 两种角色交替进行的多轮对话内容。
-* 运行说明：运行以下命令即可开始全参数微调雅意模型，该命令支持多机多卡训练，建议使用 16*A100(80G) 或以上硬件配置。
-```bash
-deepspeed --hostfile config/hostfile \
-    --module training.trainer_yayi2 \
-    --report_to "tensorboard" \
-    --data_path "./data/yayi_train_example.json" \
-    --model_name_or_path "your_model_path" \
-    --output_dir "./output" \
-    --model_max_length 2048 \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size 1 \
-    --gradient_accumulation_steps 1 \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 500 \
-    --save_total_limit 10 \
-    --learning_rate 5e-6 \
-    --warmup_steps 2000 \
-    --lr_scheduler_type cosine \
-    --logging_steps 1 \
-    --gradient_checkpointing True \
-    --deepspeed "./config/deepspeed.json" \
-    --bf16 True
-```
-或者通过命令行启动：
-```bash
-bash scripts/start.sh
-```
-### LoRA 微调
-* 数据格式：同上，参考 data/yayi_train_example_multi_rounds.json。
-* 运行以下命令即可开始 LoRA 微调雅意模型。
-```bash
-bash scripts/start_lora.sh
-```
-## 预训练数据
-* 在预训练阶段，我们除了使用互联网数据训练模型的语言能力，还添加了通用精选数据和领域数据增强模型的专业技能。数据分布情况如下：
-![data distribution](assets/data_distribution.jpg)
-* 我们构建了一套全方位提升数据质量的数据处理流水线，包括标准化、启发式清洗、多级去重、毒性过滤等四个模块。我们共收集了 240TB 原始数据，预处理后仅剩 10.6TB 高质量数据。整体流程如下：
-![data process](assets/data_process.png)
-## 分词器
-* YAYI 2 采用 Byte-Pair Encoding（BPE）作为分词算法，使用 500GB 高质量多语种语料进行训练，包括汉语、英语、法语、俄语等十余种常用语言，词表大小为 81920。
-* 我们对数字进行逐位拆分，以便进行数学相关推理；在词表中手动添加大量 html 标识符和常见标点符号，以提高分词准确性。同时，我们预设了200个保留位，以便未来可能的应用，例如在指令微调阶段添加标识符。由于是字节级别的分词算法，YAYI 2 Tokenizer 可以覆盖未知字符。
-* 我们采样了单条长度为 1万 Tokens 的数据形成评价数据集，涵盖中文、英文和一些常见小语种，并计算了模型的压缩比。
-![Alt text](assets/compression_rate.png)
-* 压缩比越低通常表示分词器具有更高效率的性能。
-## Loss 曲线
-YAYI 2 模型的 loss 曲线见下图：
-![loss](assets/loss.png)
-## 相关协议
-### 开源协议
-本项目中的代码依照 [Apache-2.0](LICENSE) 协议开源，社区使用 YAYI 2 模型和数据需要遵循[雅意YAYI 2 模型社区许可协议](YAYI2_Community_License)。若您需要将雅意 YAYI 2系列模型或其衍生品用作商业用途，请根据[《雅意 YAYI 2 模型商用许可协议》](YAYI2_Commercial_License)将商用许可申请登记信息发送至指定邮箱yayi@wenge.com。审核通过后，雅意将授予您商用版权许可，请遵循协议中的商业许可限制。
-### 引用
 如果您在工作中使用了我们的模型，请引用我们的论文：
 ```
 @article{YAYI 2,
   author    = {Yin Luo, Qingchao Kong, Nan Xu, et.al.}},
@@ -387,3 +259,4 @@ YAYI 2 模型的 loss 曲线见下图：
   journal   = {arXiv preprint arXiv},
   year      = {2023}
 ```

 - [模型地址](#模型地址)
 - [评测结果](#评测结果)
 - [推理](#推理)
 - [模型微调](#模型微调)
   - [环境安装](#环境安装-1)
   - [全参训练](#全参训练)
   - [开源协议](#开源协议)
   - [引用](#引用)
+## 介绍/Introduction
+YAYI 2 是中科闻歌研发的**新一代开源大语言模型**，包括 Base 和 Chat 版本，参数规模为 30B，并采用了超过 2 万亿 Tokens 的高质量、多语言语料进行预训练。针对通用和特定领域的应用场景，我们采用了百万级指令进行微调，同时借助人类反馈强化学习方法，以更好地使模型与人类价值观对齐。本次开源的模型为 YAYI2-30B Base 模型。我们希望通过雅意大模型的开源来促进中文预训练大模型开源社区的发展，并积极为此做出贡献。通过开源，我们与每一位合作伙伴共同构建雅意大模型生态。更多技术细节，敬请期待我们的的技术报告🔥。
+YAYI 2 is the new generation of open-source large language models launched by Wenge Technology. It has been pretrained for 2.65 trillion tokens of multilingual data with high quality. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback (RLHF). We opensource the pre-trained language model in this release, namely **YAYI2-30B**. By open-sourcing the YAYI 2 model, we aim to contribute to the development of the Chinese pre-trained large language model open-source community. Through open-source, we aspire to collaborate with every partner in building the YAYI large language model ecosystem. Stay tuned for more technical details in our upcoming technical report! 🔥
+## 模型地址/Model download
+| Model Name | Context Length  | 🤗 HF Model Name |
+|:----------|:----------:|:----------:|
+| YAYI2-30B | 4096    | wenge-research/yayi2-30b|
+## 评测结果/Evaluation
 我们在多个基准数据集上进行了评测，包括 C-Eval、MMLU、 CMMLU、AGIEval、GAOKAO-Bench、GSM8K、MATH、BBH、HumanEval 以及 MBPP。我们考察了模型在语言理解、学科知识、数学推理、逻辑推理以及代码生成方面的表现。YAYI 2 模型在与其规模相近的开源模型中展现出了显著的性能提升。
+We evaluate our model on standard benchmarks, including C-Eval, MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K, MATH, BBH, HumanEval, and MBPP. Our goal is to assess the model's performance in language comprehension, knowledge comprehension, mathematical reasoning, logical reasoning, and code generation.  YAYI 2 has demonstrated exceptional performance across models with similar size.
 <table id="myTable">
   <!-- Table header -->
   <tr>
         <th></th>
+        <th colspan="5" style="text-align: center;">Knowledge</th>
+        <th colspan="2" style="text-align: center;">Math</th>
+        <th colspan="1" style="text-align: center;">Logic reasonning</th>
+        <th colspan="2" style="text-align: center;">Code</th>
   </tr>
   <tr>
+        <th style="text-align: left;">Model</th>
         <th>C-Eval(val)</th>
         <th>MMLU</th>
         <th>AGIEval</th>
 我们使用 [OpenCompass Github 仓库](https://github.com/open-compass/opencompass) 提供的源代码进行了评测。对于对比模型，我们列出了他们在 [OpenCompass](https://opencompass.org.cn) 榜单上的评测结果，截止日期为 2023年12月15日。对于其他尚未在 [OpenCompass](https://opencompass.org.cn/leaderboard-llm) 平台参与评测的模型，包括 MPT、Falcon 和 LLaMa 2，我们采用了 [LLaMA 2](https://arxiv.org/abs/2307.09288) 报告的结果。
+We evaluate our model using the source code from the [OpenCompass Github repository](https://github.com/open-compass/opencompass). If available, we report results for comparative models assessed by OpenCompass with the evaluation reference date set to Dec. 15th, 2013. For MPT, Falfon, and Llama, which have not been evaluated by OpenCompass, we use the results reported in the [LLaMA 2](https://arxiv.org/abs/2307.09288) paper.
+## 快速开始/Quick Start
 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer
         )
 >>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
 ```
+## 协议/Liencese
+本项目中的代码依照 [Apache-2.0](LICENSE) 协议开源，社区使用 YAYI 2 模型和数据需要遵循[雅意YAYI 2 模型社区许可协议](YAYI2_Community_License)。若您需要将雅意 YAYI 2系列模型或其衍生品用作商业用途，请根据[《雅意 YAYI 2 模型商用许可协议》](YAYI2_Commercial_License)将商用许可申请登记信息发送至指定邮箱yayi@wenge.com。审核通过后，雅意将授予您商用版权许可，请遵循协议中的商业许可限制。
+The code in this project is open-sourced under the [Apache-2.0](LICENSE) license. The use of YaYi series model weights and data must adhere to the [YAYI 2 Community License](YAYI2_Community_License). If you intend to use the YAYI 2 series models or their derivatives for commercial purposes, please submit your commercial license application and registration information to yayi@wenge.com, following the [YAYI 2 Commercial License](YAYI2_Commercial_License). Upon approval, YAYI will grant you a commercial copyright license, subject to the commercial license restrictions outlined in the agreement.
+## 引用
 如果您在工作中使用了我们的模型，请引用我们的论文：
+If you are using the resource for your work, please cite the our paper:
 ```
 @article{YAYI 2,
   author    = {Yin Luo, Qingchao Kong, Nan Xu, et.al.}},
   journal   = {arXiv preprint arXiv},
   year      = {2023}
 ```