Yuren-13B （羽人13B）

Github

Yuren 13B is a large-scale language model that has been continuously trained based on Llama 2 13B. Focused on the field of information synthesis and built upon the data-centric work of Pleisto, this model achieves state-of-the-art levels in data synthesis scenarios such as information extraction in multiple languages, natural language generation of SQL, and structured data output, all with an equivalent parameter count.

羽人 13B 是在 Llama 2 13B 基础上进行持续训练的大语言模型，聚焦于信息合成领域并建立在 Pleisto 以数据为中心的工作上。该模型在以中英文为主的多种语言的信息抽取、自然语言生成 SQL、结构化数据输出等数据合成类场景下实现了同等参数量下的 SOTA 水平。

Quick Start/快速开始

from transformers import LlamaTokenizer, LlamaForCausalLM
import torch

device = torch.device("cuda")
model = LlamaForCausalLM.from_pretrained(
    "pleisto/yuren-13b-chatml", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = LlamaTokenizer.from_pretrained("pleisto/yuren-13b-chatml", use_fast=False)
system_prompt = "You are an AI model capable of translating natural language queries into SQL statements. Based on the following table schema and the subsequent user query, generate the appropriate SQL statement.\nTable schema: CREATE TABLE table_name_86 (name VARCHAR, score VARCHAR, song_type VARCHAR)\nScoreTypeEnum: [\"folk\",\"rock\",\"other\"]"
query = "8分以上的民谣有哪些？"
inputs = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(
    input_ids,
    max_new_tokens=4096,
    do_sample=True,
    top_p=1.0,
    temperature=0.32,
    eos_token_id=36845,
)
output = tokenizer.batch_decode(generate_ids)[0]
print(output)
# <s> <|im_start|> system
# You are an AI model capable of translating natural language queries into SQL statements. Based on the following table schema and the subsequent user query, generate the appropriate SQL statement.
# Table schema: CREATE TABLE table_name_86 (name VARCHAR, score VARCHAR, song_type VARCHAR)
# ScoreTypeEnum: ["folk","rock","other"] <|im_end|> <|im_start|> user
# 8分以上的民谣有哪些？ <|im_end|> <|im_start|> assistant
# SELECT name FROM table_name_86 WHERE score > 8 AND song_type = "folk" <|im_end|>

Example Dialogue/示例对话

Text2SQL/自然语言转SQL查询

System Prompt You are an AI model capable of translating natural language queries into MySQL statements. Based on the following table schema and the subsequent user query, generate the appropriate SQL statement.\nTable schema: CREATE TABLE `comments` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `message_id` int(10) unsigned NOT NULL, `user_id` int(10) unsigned NOT NULL, `text_encrypted` blob NOT NULL, `comment_secret` varchar(255) NOT NULL, `private_to_user` int(10) unsigned DEFAULT NULL, `time_inserted` int(10) unsigned NOT NULL, `deleted` tinyint(1) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`id`), KEY `selection` (`message_id`,`time_inserted`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `connections` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `from_user` int(10) unsigned NOT NULL, `type` enum('friend','block') NOT NULL, `to_user` int(10) unsigned NOT NULL, `time_inserted` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`from_user`,`to_user`), KEY `selection` (`to_user`,`type`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `favorites` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `message_id` int(10) unsigned NOT NULL, `degree` int(10) unsigned NOT NULL, `time_added` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`user_id`,`message_id`), KEY `selection` (`user_id`,`time_added`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `feeds` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `message_id` int(10) unsigned NOT NULL, `degree` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`user_id`,`message_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `ids_in_threads` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `content_type` enum('message','comment') NOT NULL, `content_id` int(10) unsigned NOT NULL, `private_id` int(10) unsigned NOT NULL, `public_id` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`content_type`,`content_id`,`private_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `messages` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `color_hex` varchar(7) NOT NULL, `pattern_id` int(10) unsigned NOT NULL, `text_encrypted` blob NOT NULL, `message_secret` varchar(255) NOT NULL, `favorites_count` int(10) unsigned NOT NULL DEFAULT '0', `comments_count` int(10) unsigned NOT NULL DEFAULT '0', `time_published` int(10) unsigned NOT NULL, `time_active` int(10) unsigned NOT NULL DEFAULT '2147483647', `language_iso3` varchar(3) DEFAULT NULL, `country_iso3` varchar(3) DEFAULT NULL, `geo_lat` float DEFAULT NULL, `geo_long` float DEFAULT NULL, `topic` enum('','politics','art','business','work','culture','health','science','sports','technology','sex','dating','beauty','books','movies','music','family','food','life','love','confessions','dreams','fantasy','friendship','funny','games','hobbies','money','party','philosophy','quotes','school','stories','studies','travel','meta') NOT NULL DEFAULT '', `score` decimal(8,6) unsigned NOT NULL DEFAULT '0.000000', `dispatched` tinyint(1) unsigned NOT NULL DEFAULT '0', `deleted` tinyint(1) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`id`), KEY `time_published` (`time_published`), KEY `dispatcher` (`dispatched`,`time_published`), KEY `popular_by_language` (`language_iso3`,`score`), KEY `latest_by_language` (`language_iso3`,`time_published`), KEY `time_active` (`time_active`), KEY `latest_by_location` (`geo_lat`,`geo_long`,`time_published`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `reports` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `content_type` enum('message','comment') NOT NULL, `content_id` int(10) unsigned NOT NULL, `reason` int(10) unsigned NOT NULL, `weight` tinyint(3) unsigned NOT NULL DEFAULT '0', `time_reported` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`user_id`,`content_type`,`content_id`), KEY `selection_by_user` (`user_id`,`time_reported`), KEY `selection_by_content` (`content_type`,`content_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `subscriptions` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `message_id` int(10) unsigned NOT NULL, `user_id` int(10) unsigned NOT NULL, `degree` int(10) unsigned NOT NULL DEFAULT '3', `reasonForBan` tinyint(1) unsigned NOT NULL DEFAULT '0', `counter` tinyint(3) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`id`), UNIQUE KEY `combination` (`message_id`,`user_id`), KEY `selection` (`user_id`,`counter`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `throttling` ( `username` varchar(255) NOT NULL, `date_str` char(8) NOT NULL, `action_type` enum('failed_login') NOT NULL, `action_count` smallint(8) unsigned NOT NULL DEFAULT '1', PRIMARY KEY (`username`,`date_str`,`action_type`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `users` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `username` varchar(255) NOT NULL, `password` varchar(255) DEFAULT NULL, `reported_count` tinyint(1) unsigned NOT NULL DEFAULT '0', `write_lock_until` int(10) unsigned NOT NULL DEFAULT '0', `login_throttled_until` int(10) unsigned NOT NULL DEFAULT '0', `time_last_active` int(10) unsigned DEFAULT NULL, `time_registered` int(10) unsigned DEFAULT NULL, PRIMARY KEY (`id`), UNIQUE KEY `username` (`username`), KEY `login` (`username`(166),`password`(166)) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `verifications` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `new_password` varchar(255) NOT NULL, `verification_code` varchar(255) NOT NULL, `time_created` int(10) unsigned NOT NULL, `time_until` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), KEY `selection_by_user` (`user_id`,`time_until`), KEY `selection_by_code` (`verification_code`,`time_until`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

User Prompt 最新的一条报告的作者ID 和对应的报告原因是什么？

Yuren13b Output SELECT reports.user\_id, reports.reason FROM reports ORDER BY reports.time\_reported DESC LIMIT 1

System Prompt	You are an AI model capable of translating natural language queries into MySQL statements. Based on the following table schema and the subsequent user query, generate the appropriate SQL statement.\nTable schema: CREATE TABLE `comments` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `message_id` int(10) unsigned NOT NULL, `user_id` int(10) unsigned NOT NULL, `text_encrypted` blob NOT NULL, `comment_secret` varchar(255) NOT NULL, `private_to_user` int(10) unsigned DEFAULT NULL, `time_inserted` int(10) unsigned NOT NULL, `deleted` tinyint(1) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`id`), KEY `selection` (`message_id`,`time_inserted`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `connections` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `from_user` int(10) unsigned NOT NULL, `type` enum('friend','block') NOT NULL, `to_user` int(10) unsigned NOT NULL, `time_inserted` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`from_user`,`to_user`), KEY `selection` (`to_user`,`type`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `favorites` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `message_id` int(10) unsigned NOT NULL, `degree` int(10) unsigned NOT NULL, `time_added` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`user_id`,`message_id`), KEY `selection` (`user_id`,`time_added`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `feeds` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `message_id` int(10) unsigned NOT NULL, `degree` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`user_id`,`message_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `ids_in_threads` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `content_type` enum('message','comment') NOT NULL, `content_id` int(10) unsigned NOT NULL, `private_id` int(10) unsigned NOT NULL, `public_id` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`content_type`,`content_id`,`private_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `messages` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `color_hex` varchar(7) NOT NULL, `pattern_id` int(10) unsigned NOT NULL, `text_encrypted` blob NOT NULL, `message_secret` varchar(255) NOT NULL, `favorites_count` int(10) unsigned NOT NULL DEFAULT '0', `comments_count` int(10) unsigned NOT NULL DEFAULT '0', `time_published` int(10) unsigned NOT NULL, `time_active` int(10) unsigned NOT NULL DEFAULT '2147483647', `language_iso3` varchar(3) DEFAULT NULL, `country_iso3` varchar(3) DEFAULT NULL, `geo_lat` float DEFAULT NULL, `geo_long` float DEFAULT NULL, `topic` enum('','politics','art','business','work','culture','health','science','sports','technology','sex','dating','beauty','books','movies','music','family','food','life','love','confessions','dreams','fantasy','friendship','funny','games','hobbies','money','party','philosophy','quotes','school','stories','studies','travel','meta') NOT NULL DEFAULT '', `score` decimal(8,6) unsigned NOT NULL DEFAULT '0.000000', `dispatched` tinyint(1) unsigned NOT NULL DEFAULT '0', `deleted` tinyint(1) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`id`), KEY `time_published` (`time_published`), KEY `dispatcher` (`dispatched`,`time_published`), KEY `popular_by_language` (`language_iso3`,`score`), KEY `latest_by_language` (`language_iso3`,`time_published`), KEY `time_active` (`time_active`), KEY `latest_by_location` (`geo_lat`,`geo_long`,`time_published`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `reports` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `content_type` enum('message','comment') NOT NULL, `content_id` int(10) unsigned NOT NULL, `reason` int(10) unsigned NOT NULL, `weight` tinyint(3) unsigned NOT NULL DEFAULT '0', `time_reported` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `combination` (`user_id`,`content_type`,`content_id`), KEY `selection_by_user` (`user_id`,`time_reported`), KEY `selection_by_content` (`content_type`,`content_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `subscriptions` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `message_id` int(10) unsigned NOT NULL, `user_id` int(10) unsigned NOT NULL, `degree` int(10) unsigned NOT NULL DEFAULT '3', `reasonForBan` tinyint(1) unsigned NOT NULL DEFAULT '0', `counter` tinyint(3) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`id`), UNIQUE KEY `combination` (`message_id`,`user_id`), KEY `selection` (`user_id`,`counter`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `throttling` ( `username` varchar(255) NOT NULL, `date_str` char(8) NOT NULL, `action_type` enum('failed_login') NOT NULL, `action_count` smallint(8) unsigned NOT NULL DEFAULT '1', PRIMARY KEY (`username`,`date_str`,`action_type`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `users` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `username` varchar(255) NOT NULL, `password` varchar(255) DEFAULT NULL, `reported_count` tinyint(1) unsigned NOT NULL DEFAULT '0', `write_lock_until` int(10) unsigned NOT NULL DEFAULT '0', `login_throttled_until` int(10) unsigned NOT NULL DEFAULT '0', `time_last_active` int(10) unsigned DEFAULT NULL, `time_registered` int(10) unsigned DEFAULT NULL, PRIMARY KEY (`id`), UNIQUE KEY `username` (`username`), KEY `login` (`username`(166),`password`(166)) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE `verifications` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `new_password` varchar(255) NOT NULL, `verification_code` varchar(255) NOT NULL, `time_created` int(10) unsigned NOT NULL, `time_until` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), KEY `selection_by_user` (`user_id`,`time_until`), KEY `selection_by_code` (`verification_code`,`time_until`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
User Prompt	最新的一条报告的作者ID 和对应的报告原因是什么？
Yuren13b Output	`SELECT reports.user\_id, reports.reason FROM reports ORDER BY reports.time\_reported DESC LIMIT 1`

Structured Data Extraction/结构化数据抽取

System Prompt Extract meta data from a resume full text. User request according to the following JSON Schema: {"type":"object","properties":{"lawsuit":{"type":"string","description":"Resume full text."}},"required":["lawsuit"]} Translate it into JSON objects according to the following JSON Schema: {"type":"object","properties":{"name":{"type":"string"},"profession":{"type":"string"},"experienceYears":{"type":"number"},"skills":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"proficiency":{"type":"string"}},"required":["name","proficiency"],"additionalProperties":false}},"projectExperiences":{"type":"array","items":{"type":"object","properties":{"title":{"type":"string"},"role":{"type":"string"},"description":{"type":"string"}},"required":["title","role","description"],"additionalProperties":false}},"education":{"type":"object","properties":{"degree":{"type":"string"},"fieldOfStudy":{"type":"string"},"university":{"type":"string"},"year":{"type":"number"}},"required":["degree","fieldOfStudy","university","year"],"additionalProperties":false}},"required":["name","profession","experienceYears","skills","projectExperiences","education"]}

User Prompt user request:{"lawsuit":["\n我是张三，一名资深的软件工程师，拥有超过七年的前端开发经验。我热衷于构建出色的用户界面，熟练运用HTML、CSS和JavaScript，并精通React、Vue以及Angular等前端框架。我曾参与过多个大型项目，负责设计和实现前端架构，确保网站的高性能和用户友好性。此外，我还具备项目管理的经验，能够带领团队按时交付高质量的成果。\n\n### 项目经历\n\n#### 1. 电商网站重构 (ABC 公司)\n- 负责参与了ABC公司旗下电商网站的重构项目，担任前端技术负责人。\n- 使用React框架重建网站前端，实现了页面响应式设计和动态加载功能，提升了用户体验。\n- 优化前端性能，减少了页面加载时间，提高了网站整体性能。\n- 设计并实施了用户行为跟踪和分析系统，为市场营销团队提供了关键的数据支持。\n\n#### 2. 社交媒体应用开发 (XYZ 创业公司)\n- 领导一个四人的前端开发团队，从零开始开发了一款社交媒体应用。\n- 采用了Vue.js框架和Vuex进行状态管理，实现了实时聊天、帖子发布和用户互动功能。\n- 集成了第三方登录和分享功能，提升了用户注册和活跃度。\n- 成功将应用推向市场，用户数量从零增长到五万以上。\n\n#### 3. 内部管理系统升级 (DEF 企业)\n- 负责升级公司内部管理系统，从传统的后端渲染转变为现代化的前后端分离架构。\n- 使用Angular框架开发新的前端界面，实现了快速的数据加载和交互功能。\n- 利用GraphQL优化了与后端的数据通信，减少了不必要的请求次数，提高了系统效率。\n- 通过培训和文档编写，帮助团队成员顺利过渡到新的技术栈。\n\n### 技能和专业知识\n\n- 前端技术: HTML, CSS, JavaScript, React, Vue, Angular, Redux, GraphQL\n- 前端工具: Webpack, Babel, ESLint\n- 项目管理: Agile, Scrum, Jira\n\n### 教育背景\n\n- 学士学位，计算机科学，北京大学，2012年\n"]} json object:

Yuren13b Output { "name":"张三", "profession":"软件工程师", "experienceYears":7, "skills":\[ { "name":"HTML", "proficiency":"熟练" }, { "name":"CSS", "proficiency":"熟练" }, { "name":"JavaScript", "proficiency":"熟练" }, { "name":"React", "proficiency":"精通" }, { "name":"Vue", "proficiency":"精通" }, { "name":"Angular", "proficiency":"精通" } \], "projectExperiences":\[ { "title":"电商网站重构", "role":"前端技术负责人", "description":"负责参与了ABC公司旗下电商网站的重构项目，担任前端技术负责人。使用React框架重建网站前端，实现了页面响应式设计和动态加载功能，提升了用户体验。优化前端性能，减少了页面加载时间，提高了网站整体性能。设计并实施了用户行为跟踪和分析系统，为市场营销团队提供了关键的数据支持。" }, { "title":"社交媒体应用开发", "role":"领导", "description":"领导一个四人的前端开发团队，从零开始开发了一款社交媒体应用。采用了Vue.js框架和Vuex进行状态管理，实现了实时聊天、帖子发布和用户互动功能。集成了第三方登录和分享功能，提升了用户注册和活跃度。成功将应用推向市场，用户数量从零增长到五万以上。" }, { "title":"内部管理系统升级", "role":"负责人", "description":"负责升级公司内部管理系统，从传统的后端渲染转变为现代化的前后端分离架构。使用Angular框架开发新的前端界面，实现了快速的数据加载和交互功能。利用GraphQL优化了与后端的数据通信，减少了不必要的请求次数，提高了系统效率。通过培训和文档编写，帮助团队成员顺利过渡到新的技术栈。" } \], "education":{ "degree":"学士", "fieldOfStudy":"计算机科学", "university":"北京大学", "year":2012 } }

System Prompt	Extract meta data from a resume full text. User request according to the following JSON Schema: {"type":"object","properties":{"lawsuit":{"type":"string","description":"Resume full text."}},"required":["lawsuit"]} Translate it into JSON objects according to the following JSON Schema: {"type":"object","properties":{"name":{"type":"string"},"profession":{"type":"string"},"experienceYears":{"type":"number"},"skills":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"proficiency":{"type":"string"}},"required":["name","proficiency"],"additionalProperties":false}},"projectExperiences":{"type":"array","items":{"type":"object","properties":{"title":{"type":"string"},"role":{"type":"string"},"description":{"type":"string"}},"required":["title","role","description"],"additionalProperties":false}},"education":{"type":"object","properties":{"degree":{"type":"string"},"fieldOfStudy":{"type":"string"},"university":{"type":"string"},"year":{"type":"number"}},"required":["degree","fieldOfStudy","university","year"],"additionalProperties":false}},"required":["name","profession","experienceYears","skills","projectExperiences","education"]}
User Prompt	user request:{"lawsuit":["\n我是张三，一名资深的软件工程师，拥有超过七年的前端开发经验。我热衷于构建出色的用户界面，熟练运用HTML、CSS和JavaScript，并精通React、Vue以及Angular等前端框架。我曾参与过多个大型项目，负责设计和实现前端架构，确保网站的高性能和用户友好性。此外，我还具备项目管理的经验，能够带领团队按时交付高质量的成果。\n\n### 项目经历\n\n#### 1. 电商网站重构 (ABC 公司)\n- 负责参与了ABC公司旗下电商网站的重构项目，担任前端技术负责人。\n- 使用React框架重建网站前端，实现了页面响应式设计和动态加载功能，提升了用户体验。\n- 优化前端性能，减少了页面加载时间，提高了网站整体性能。\n- 设计并实施了用户行为跟踪和分析系统，为市场营销团队提供了关键的数据支持。\n\n#### 2. 社交媒体应用开发 (XYZ 创业公司)\n- 领导一个四人的前端开发团队，从零开始开发了一款社交媒体应用。\n- 采用了Vue.js框架和Vuex进行状态管理，实现了实时聊天、帖子发布和用户互动功能。\n- 集成了第三方登录和分享功能，提升了用户注册和活跃度。\n- 成功将应用推向市场，用户数量从零增长到五万以上。\n\n#### 3. 内部管理系统升级 (DEF 企业)\n- 负责升级公司内部管理系统，从传统的后端渲染转变为现代化的前后端分离架构。\n- 使用Angular框架开发新的前端界面，实现了快速的数据加载和交互功能。\n- 利用GraphQL优化了与后端的数据通信，减少了不必要的请求次数，提高了系统效率。\n- 通过培训和文档编写，帮助团队成员顺利过渡到新的技术栈。\n\n### 技能和专业知识\n\n- 前端技术: HTML, CSS, JavaScript, React, Vue, Angular, Redux, GraphQL\n- 前端工具: Webpack, Babel, ESLint\n- 项目管理: Agile, Scrum, Jira\n\n### 教育背景\n\n- 学士学位，计算机科学，北京大学，2012年\n"]} json object:
Yuren13b Output	{ "name":"张三", "profession":"软件工程师", "experienceYears":7, "skills":\[ { "name":"HTML", "proficiency":"熟练" }, { "name":"CSS", "proficiency":"熟练" }, { "name":"JavaScript", "proficiency":"熟练" }, { "name":"React", "proficiency":"精通" }, { "name":"Vue", "proficiency":"精通" }, { "name":"Angular", "proficiency":"精通" } \], "projectExperiences":\[ { "title":"电商网站重构", "role":"前端技术负责人", "description":"负责参与了ABC公司旗下电商网站的重构项目，担任前端技术负责人。使用React框架重建网站前端，实现了页面响应式设计和动态加载功能，提升了用户体验。优化前端性能，减少了页面加载时间，提高了网站整体性能。设计并实施了用户行为跟踪和分析系统，为市场营销团队提供了关键的数据支持。" }, { "title":"社交媒体应用开发", "role":"领导", "description":"领导一个四人的前端开发团队，从零开始开发了一款社交媒体应用。采用了Vue.js框架和Vuex进行状态管理，实现了实时聊天、帖子发布和用户互动功能。集成了第三方登录和分享功能，提升了用户注册和活跃度。成功将应用推向市场，用户数量从零增长到五万以上。" }, { "title":"内部管理系统升级", "role":"负责人", "description":"负责升级公司内部管理系统，从传统的后端渲染转变为现代化的前后端分离架构。使用Angular框架开发新的前端界面，实现了快速的数据加载和交互功能。利用GraphQL优化了与后端的数据通信，减少了不必要的请求次数，提高了系统效率。通过培训和文档编写，帮助团队成员顺利过渡到新的技术栈。" } \], "education":{ "degree":"学士", "fieldOfStudy":"计算机科学", "university":"北京大学", "year":2012 } }

ReAct

System Prompt	You are helpful AI assistant with functions invocation capability enabled. First I will give you the task description, and your task start. At each step, you need to give your thought to analyze the status now and what to do next, with a function call to actually excute your step. Your output should follow this format: Thought: Action Action Input: After the call, you will get the call result, and you are now in a new state. Then you will analyze your status now, then decide what to do next... After many (Thought-call) pairs, you finally perform the task, then you can give your finial answer. Remember: 1.the state change is irreversible, you can't go back to one of the former state, if you want to restart the task, say "I give up and restart". 2.All the thought is short, at most in 5 sentence. 3.You can do more then one trys, so if your plan is to continusly try some conditions, you can do one of the conditions per try. Let's Begin! Task description: You should use functions to help handle the real time user querys. Remember: 1.ALWAYS call "Finish" function at the end of the task. And the final answer should contain enough information to show to the user,If you can't handle the task, or you find that function calls always fail(the function is not valid now), use function Finish->give_up_and_restart. 2.Do not use origin tool names, use only subfunctions' names. You have access of the following tools: 1.memeados: Generate custom image, gif and video memes. Specifically, you have access to the following APIs: [{'name': 'drakelikehate_for_memeados', 'description': 'This is the subfunction for tool "memeados", you can use this tool.The description of this function is: "Generate Drake Likes and Hates meme"', 'parameters': {'type': 'object', 'properties': {'text2': {'type': 'string', 'description': '', 'example_value': 'This text is liked.'}, 'text1': {'type': 'string', 'description': '', 'example_value': 'This text is hated'}}, 'required': ['text2', 'text1'], 'optional': []}}, {'name': 'pet_pet_for_memeados', 'description': 'This is the subfunction for tool "memeados", you can use this tool.The description of this function is: "Generate My pet_pet_for_memeados meme GIF"', 'parameters': {'type': 'object', 'properties': {'image': {'type': 'string', 'description': '', 'example_value': 'https://i.pravatar.cc/300'}}, 'required': ['image'], 'optional': []}}, {'name': 'sponge_bob_for_memeados', 'description': 'This is the subfunction for tool "memeados", you can use this tool.The description of this function is: "Generate Sponge Bob meme"', 'parameters': {'type': 'object', 'properties': {'text': {'type': 'string', 'description': '', 'example_value': 'Example test'}}, 'required': ['text'], 'optional': []}}, {'name': 'google_fake_autocomplete_for_memeados', 'description': 'This is the subfunction for tool "memeados", you can use this tool.The description of this function is: "Generate Fake google autocomplete"', 'parameters': {'type': 'object', 'properties': {'text1': {'type': 'string', 'description': '', 'example_value': 'Search'}, 'text3': {'type': 'string', 'description': '', 'example_value': 'search autocomplete 2'}, 'text2': {'type': 'string', 'description': '', 'example_value': 'search autocomplete 1'}}, 'required': ['text1', 'text3', 'text2'], 'optional': []}}, {'name': 'Finish', 'description': "If you think you get the result which can answer the task, call this function to give the final answer. Or, if you think you can't handle the task from this status, call this function to restart. Remember: you should ALWAYS call this function at the end of your try, and the final answer is the ONLY part that will be showed to user, so final answer should contain enough information.", 'parameters': {'type': 'object', 'properties': {'return_type': {'type': 'string', 'enum': ['give_answer', 'give_up_and_restart']}, 'final_answer': {'type': 'string', 'description': 'The final answer you want to give the user. You should have this field if "return_type"=="give_answer"'}}, 'required': ['return_type']}}]
User Prompt	我需要为我的家庭聚会制作一个梗图。你能生成一张海绵宝宝的梗图，上面写着“家庭爱”吗？另外，我想用一张家庭照片作为输入图像，生成一张Pet Pet梗图的GIF。
Yuren13b Output	Thought: Action: sponge_bob_for_memeados Action Input: { "text": "家庭爱" }

System Prompt

You are helpful AI assistant with functions invocation capability enabled. First I will give you the task description, and your task start. At each step, you need to give your thought to analyze the status now and what to do next, with a function call to actually excute your step. Your output should follow this format: Thought: Action Action Input: After the call, you will get the call result, and you are now in a new state. Then you will analyze your status now, then decide what to do next... After many (Thought-call) pairs, you finally perform the task, then you can give your finial answer. Remember: 1.the state change is irreversible, you can't go back to one of the former state, if you want to restart the task, say "I give up and restart". 2.All the thought is short, at most in 5 sentence. 3.You can do more then one trys, so if your plan is to continusly try some conditions, you can do one of the conditions per try. Let's Begin! Task description: You should use functions to help handle the real time user querys. Remember: 1.ALWAYS call "Finish" function at the end of the task. And the final answer should contain enough information to show to the user,If you can't handle the task, or you find that function calls always fail(the function is not valid now), use function Finish->give_up_and_restart. 2.Do not use origin tool names, use only subfunctions' names. You have access of the following tools: 1.memeados: Generate custom image, gif and video memes. Specifically, you have access to the following APIs: [{'name': 'drakelikehate_for_memeados', 'description': 'This is the subfunction for tool "memeados", you can use this tool.The description of this function is: "Generate Drake Likes and Hates meme"', 'parameters': {'type': 'object', 'properties': {'text2': {'type': 'string', 'description': '', 'example_value': 'This text is liked.'}, 'text1': {'type': 'string', 'description': '', 'example_value': 'This text is hated'}}, 'required': ['text2', 'text1'], 'optional': []}}, {'name': 'pet_pet_for_memeados', 'description': 'This is the subfunction for tool "memeados", you can use this tool.The description of this function is: "Generate My pet_pet_for_memeados meme GIF"', 'parameters': {'type': 'object', 'properties': {'image': {'type': 'string', 'description': '', 'example_value': 'https://i.pravatar.cc/300'}}, 'required': ['image'], 'optional': []}}, {'name': 'sponge_bob_for_memeados', 'description': 'This is the subfunction for tool "memeados", you can use this tool.The description of this function is: "Generate Sponge Bob meme"', 'parameters': {'type': 'object', 'properties': {'text': {'type': 'string', 'description': '', 'example_value': 'Example test'}}, 'required': ['text'], 'optional': []}}, {'name': 'google_fake_autocomplete_for_memeados', 'description': 'This is the subfunction for tool "memeados", you can use this tool.The description of this function is: "Generate Fake google autocomplete"', 'parameters': {'type': 'object', 'properties': {'text1': {'type': 'string', 'description': '', 'example_value': 'Search'}, 'text3': {'type': 'string', 'description': '', 'example_value': 'search autocomplete 2'}, 'text2': {'type': 'string', 'description': '', 'example_value': 'search autocomplete 1'}}, 'required': ['text1', 'text3', 'text2'], 'optional': []}}, {'name': 'Finish', 'description': "If you think you get the result which can answer the task, call this function to give the final answer. Or, if you think you can't handle the task from this status, call this function to restart. Remember: you should ALWAYS call this function at the end of your try, and the final answer is the ONLY part that will be showed to user, so final answer should contain enough information.", 'parameters': {'type': 'object', 'properties': {'return_type': {'type': 'string', 'enum': ['give_answer', 'give_up_and_restart']}, 'final_answer': {'type': 'string', 'description': 'The final answer you want to give the user. You should have this field if "return_type"=="give_answer"'}}, 'required': ['return_type']}}]

User Prompt

我需要为我的家庭聚会制作一个梗图。你能生成一张海绵宝宝的梗图，上面写着“家庭爱”吗？另外，我想用一张家庭照片作为输入图像，生成一张Pet Pet梗图的GIF。

Yuren13b Output

Thought: Action: sponge_bob_for_memeados Action Input: { "text": "家庭爱" }

Other/其他场景

User Prompt	Yuren13b Output
根据以下事件描述，请给出事件所发生的时间（如果没有提及时间则输出无时间） 3月9日，“20豫能化CP001”主承销商农行发布公告称，拟于3月12日召开河南能源化工集团有限公司“20豫能化CP001”2021年度第一次持有人会议，审议《关于同意调整本次债券持有人会议召开程序的议案》和《关于同意发行人先行支付存续期间全部利息、兑付50%本金，剩余本金展期1年，展期期间利率保持不变，到期一次性还本付息，并豁免本期债券违约的议案》。资料显示，“20豫能化CP001”发行总额10亿元，期限1年，债券利率5.50%，主体评级为BB，发行日期为2020年3月19日，到期日为2021年3月23日。	2021年3月12日
请阅读文本，用几句话概括文中的重点。\n并以 {"text": str} 形式的json格式返回结果 9月 25 日，华为举办秋季全场景新品发布会。往年这场发布会上的重头戏都是其旗舰系列的产品更新，而在今年，华为反其道而行之。先是在上个月就推出了Mate 60Pro、Mate 60Pro+和 Mate X5三款机型的“先锋计划”，直接在线上线下渠道面向消费者发售。而这次发布会之前，外界也对其是否真正在发布会上交代手机的具体参数、以及对其如何解决手机中的5G模块倍感关心。不过，这次发布会上，华为并没有展开任何和Mate 60系列手机、芯片相关的任何细节，华为终端BG CEO余承东仅仅透露，目前华为正在对先锋计划的几款手机做加班加点的生产。这次秋季发布会上的亮点是——高端品牌升级、星闪技术落地到产品、鸿蒙升级等等。首先是，华为推出了「非凡大师」系列的全新高端品牌，还邀请了刘德华代言。这一品牌的前身是保时捷高端系列，此前华为与保时捷合作，借鉴了这一品牌车型的设计，讲这些经典范式复用到手机和手表等品类上。华为方面解释，这一全新的系列将主要面向的各领域的精英人士，会主要搭载华为重点的设计工艺和技术创新。为了充实这一全新的系列，华为也发布了首款用纯手工打造的18K黄金手表，支持百米深潜和双向北斗卫星功能等的手表，售价高达 21999 元。	{"text": "华为秋季发布会上，推出了「非凡大师」系列的全新高端品牌，邀请了刘德华代言。同时，华为还推出了一款18K黄金手表，售价高达21999元。"}
给定的实体类型列表是['组织机构', '地理位置', '人物']\n根据实体类型列表抽取，在这个句子中可能包含哪些实体？你可以先别出实体, 再判断实体类型。请以"(实体,实体类型)"的格式回答。\n日前，中科院所属科研院所百余名专家、教授摆摊设点，同淮安市一百多个市属工业、乡镇企业、个体私营企业的二千多人直接洽谈，场景十分火爆。	输入中包含的实体是： (中科院,组织机构),(淮安市,地理位置),(NAN,人物)
已知候选的关系列表：['所在行政领土', '别名']，请你根据关系列表，从以下输入中抽取出可能存在的头实体与尾实体，并给出对应的关系三元组。请按照"(头实体,关系,尾实体)"的格式回答小高镇，原为小高乡，是中华人民共和国四川省凉山彝族自治州德昌县下辖的一个乡镇级行政单位。2019年12月，撤销小高镇，将原小高镇安宁村所属行政区域为德州街道的行政区域，将原小高镇连丰村、联盟村、群英村、红岩村、黄草村、海花村、高丰村、杉木村所属行政区域划归乐跃镇管辖。	输入中包含的关系三元组是： (小高镇,所在行政领土,德昌县),(四川省,所在行政领土,中华人民共和国),(凉山彝族自治州,所在行政领土,四川省),(德昌县,所在行政领土,凉山彝族自治州),(安宁村,所在行政领土,小高镇),(德州街道,所在行政领土,德昌县),(连丰村,所在行政领土,小高镇),(联盟村,所在行政领土,小高镇),(群英村,所在行政领土,小高镇),(红岩村,所在行政领土,小高镇),(黄草村,所在行政领土,小高镇),(海花村,所在行政领土,小高镇),(杉木村,所在行政领土,小高镇),(乐跃镇,所在行政领土,德昌县),(小高镇,别名,小高乡)
抽取安全领域文本的实体信息，结果以字典形式输出，若无信息，则返回空字典。实体类别如下：['职务','飞机','时间','船只','数量','枪炮导弹','组织','机场','港口','方位','人物','航母','国家','地面武器','部队','地点'] 据《印度快报》网站3月18日报道，印度与缅甸近日开展联合行动，摧毁了位于印度东部边境附近的激进武装分子营地。	`{ "时间": \[ "3月18日" \], "国家": \[ "印度", "缅甸" \], "地点": \[ "印度东部边境附近" \] }`
一个社会的进步，尤其是文明社会的进步需要的是有素质的公民，这也是社会对公民的根本要求。“禁止地铁中吃东西”的原委是为了塑造良好的地铁环境卫生，为的是形成一个良好的社会风气。按照制度经济学的观点，如香港地铁的规定乃是一种正式制度安排，因为有详细的惩罚细则。而正式制度的遵守需要非正式制度的约束，即人们的习惯。在这样的基础上才能形成遵守正式制度的好风气。而这样的一种约束或者说制度安排根本上无关乎歧视内地人，因为这是对于全部人们的要求，也非具有针对性的规定。作者认为“香港地铁的规定”是	一种正式制度安排，因为有详细的惩罚细则。
今年“五一”假期铁路运输期限为4月27日至5月4日，共8天。来自中国铁路北京局集团公司(简称“中国铁路北京局”)的消息，4月27日，中国铁路北京局共计发送旅客111.48万人，其中北京地区发送旅客57.5万人(北京站发送7.79万人、北京西站发送16.45万人、北京南站发送18.77万人、北京北站1.58万人、清河站2.17万、北京朝阳站4.89万人、北京丰台站4.57万人)。4月28日，中国铁路北京局预计发送旅客130万人，其中北京地区预计发送旅客68.54万人。中国铁路北京局结合假期运输特点，全力提升旅客乘车的舒适度。各次高铁列车针对互联网订餐订单量激增的实际情况，合理安排配送人员，做好列车配送服务，保障好出行旅客的网络订餐。运行中结合旅客实际需求，从列车环境、服务、用餐等方面提供多元化服务，部分列车还配有儿童读物和拼图，满足不同旅客的需求。北京站在候车大厅西侧明显位置，设置“医疗服务”室，配备诊疗床、AED除颤仪等基础急救设备，为旅客提供寻医问药服务。北京西站针对年纪太大无法久坐、手术后无法坐立的旅客提供可躺式座椅。北京南站4月27日至5月4日全面开启地下换乘层京沪高铁、京津城际4个快速进站厅，每日开关时间根据现场客流情况动态延长调整。北京北站启用地上地下进站口，“德馨”服务组推出便民服务车，关口前移，提升旅客出行体验。清河站全面启用南落客和西进站口，全面满足地铁、公交、出租车、私家车等多种交通方式接驳的旅客乘车。 4月27日北京哪个火车站发送旅客人次最多？北京铁路局采取哪些措施提升服务水平？	北京火车站发送旅客人次最多的是北京站，共计发送旅客7.79万人。北京铁路局采取了多种措施提升服务水平，包括：1. 提供多元化服务，如配置儿童读物和拼图；2. 设置医疗服务室，配备基础急救设备；3. 启用地上地下进站口，增加进站效率；4. 提供便民服务车，关口前移；5. 全面启用清河站的南落客和西进站口，满足多种交通方式接驳的需求。
目的：探讨阑尾周边高回声结构的性质，评价其在急性阑尾炎超声诊断中的临床意义。翻译成英文	AIM: To evaluate the ultrasound features of the high-echoic ring around the cecal tip in patients with acute appendicitis.
我是中国的一家跨境电商公司，主要出售便携式咖啡机，请以沙特的斋月用主题写一篇英文播客的稿件，目的是让客户感到我对他的节日的尊重，让他觉得我是了解他们的节日，提升对我的好感度根据赞美	Welcome to our special Ramadan edition of our podcast! As a Chinese cross-border e-commerce company, we are honored to be a part of the Ramadan season, a special time for our Muslim friends around the world. Ramadan is a religious observance in Islam, lasting one month, during which Muslims fast from dawn until sunset. It is a time of spiritual discipline, sacrifice, and sacrifice. It is also a time of community and togetherness, as families and friends gather to break fast together at sunset. At our company, we understand the importance of respecting and celebrating religious holidays. That's why we have curated a selection of portable coffee machines that are perfect for breaking fast during the Ramadan season. Our portable coffee machines are small, lightweight, and easy to carry, making them perfect for traveling or for use in a hotel room or other temporary living space. They are also easy to use, with a simple one-touch brewing system that allows you to make a fresh, delicious cup of coffee in just a few seconds. We understand that during the Ramadan season, it is important to have a steady supply of coffee to help you stay awake and alert during the day. Our portable coffee machines are designed to be easy to use and to provide a consistent, delicious cup of coffee every time. We hope that our portable coffee machines will help you enjoy the Ramadan season to the fullest, and that they will be a part of your daily routine during this special time. Thank you for choosing our company, and happy Ramadan!
这是一条语料，存在需要清洗去掉的脏数据。请给我清洗后的干净的文本。hashtag 也删掉。 #头条创作挑战赛##旅游##我要上头条##今日头条日签##城市#在<a>我国，有许多省份都有一个别 称。比如山东被称为齐鲁大地、河南称为中原之地等等。而今天我们要说的这个地方也有个别名——赣鄱明珠！它就是位于江西省东北部的上饶市。那么接下来就让我们一起走进这座美丽的江南小城吧~一.地理位置与地形概况1.地理位置上饶地处武夷山脉和怀玉山脉之间，是长江三角洲经济区的重要组成部分。Copyright (c) 2020 XXX	在我国，有许多省份都有一个别称。比如山东被称为齐鲁大地、河南称为中原之地等等。而今天我们要说的这个地方也有个别名——赣鄱明珠！它就是位于江西省东北部的上饶市。那么接下来就让我们一起走进这座美丽的江南小城吧！一.地理位置与地形概况1.地理位置上饶地处武夷山脉和怀玉山脉之间，是长江三角洲经济区的重要组成部分。
痛风发作的时候应该冷敷还热敷	痛风发作时，应该采取冷敷的方式来缓解疼痛。冷敷可以减轻疼痛和炎症，并有助于控制疼痛的持续时间。在疼痛发作时，应尽量避免热敷，因为热敷可能会加重疼痛。
过去两年便携式储能站上了风口，如今行业赛场延伸到了家庭储能（也称户用储能）。行业头部厂商华宝新能、正浩、安克创新等均在近一年内进军家庭储能市场。\n 以 json 数组返回提及的公司名字	`\[ "华宝新能", "正浩", "安克创新" ]\`

Details of model training/模型训练细节

Extending vocabulary/词表扩充

In the original Llama vocabulary, only a few hundred Chinese characters were included, and the remaining Chinese characters had to be generated by concatenating multiple Unicode bytes. This issue not only obviously affects the Chinese inference performance (generation speed), but also significantly creates a performance bottleneck in Chinese semantic understanding.

We conducted a series of comparative experiments on different vocabulary expansion approaches and found the following:

Compared to the prevailing strategy of adding a large number of commonly used Chinese character words to the vocabulary, simply adding Chinese character characters to the vocabulary can achieve better semantic understanding performance with a smaller scale of pretraining data. In our experiments, we found that the existing BPE-based tokenizers for Chinese word segmentation inevitably lead to token segmentation that is difficult to align with the true semantics due to the inherent ambiguity in word segmentation. Although increasing the model's parameter size and diversifying the training data can enable the model to have the ability to correctly understand incorrectly segmented tokens during the pretraining process, this understanding always comes with additional costs.
The number of newly added tokens during vocabulary expansion is directly proportional to the perturbation of the original token distribution, so the fewer new tokens added, the less impact it will have on the semantic disturbance of existing tokens.

Therefore, considering these factors, we conservatively expanded the vocabulary by 4843 tokens. Specifically, this includes all the primary Chinese characters and a subset of secondary and tertiary Chinese characters from the "General Standard Chinese Character Table" published by the National Language Commission in 2013. This subset was derived by using Pleisto's proprietary Chinese corpus to calculate the frequency of commonly used Chinese characters, with the aim of covering as many commonly used Chinese characters as possible, including those in the fields of science and technology, as well as commonly used Chinese characters in personal and place names. Additionally, a portion of commonly used punctuation marks in Chinese language were also included.

原始 Llama 词表中仅含有几百个汉字，其余汉字均需要以多个 unicode 字节形式拼接生成。这一问题除了显而易见地导致中文推理性能（生成速度）受到影响之外，还在很大程度上造成了模型在中文语义理解上造成了性能瓶颈。

我们对于不同的词表扩充方案进行了一系列对比实验并发现：

相较于目前主流的在词表中加入大量的常用汉字词语的策略而言，仅在词表中添加汉字字符的方案可以在更少的预训练数据规模下实现更佳的语义理解性能。我们在实验中发现，现有的基于 BPE 的分词器进行中文分词时由于分词本身存在的歧义性几乎必然导致生成的 Token 分割难以与真实语义进行对齐。尽管通过提升模型参数量、增加训练数据的规模和多样性，可以让模型本身在预训练过程中拥有正确理解被错误分割的 token 的能力，但这种理解始终是有额外成本的。
扩充词表时新增的 token 的数量和对于原始词向量的分布的扰动始终成正比，因此新增的 token 越少对于已有 token 的语义扰动的影响就会越少。

鉴于此我们较为保守地扩充了 4843 个 Token，具体而言包括国家语委在 2013 年发布的《通用规范汉字表》中的全部一级汉字、二三级汉字的一个子集（该子集通过使用 Pleisto 自有的中文语料进行汉字常用字字频统计后得出以期最大可能地覆盖包括科学技术领域常用字、人名地名常用字在内的所有常用汉字）、汉语中较常使用的一部分标点符号。

Training an embedding layer/词向量嵌入层的训练

Although the mainstream approach to extending the vocabulary typically does not involve separate pretraining of the embedding layer due to cost considerations, and instead relies on updating the embedding layer during continuous pretraining to achieve alignment, our research has revealed the following:

Completely initializing the newly added word embeddings randomly and then achieving semantic alignment during the continuous pretraining phase can result in the model struggling to effectively learn a portion of the pretraining data. This issue becomes particularly evident when the pretraining data consists of carefully curated high-quality datasets.
Freezing the other layers and training only the embedding layer using diverse datasets, especially those containing multilingual parallel corpora, different from the ones used in continuous pretraining, helps enhance the model's semantic understanding capabilities. This approach proves beneficial, particularly when the pretraining phase employs small-scale high-quality datasets. Pretraining the embedding layer with a more diverse and unfiltered dataset before continuous pretraining improves the model's semantic understanding and resilience.

Therefore, considering these findings, we conducted one epoch of pretraining on the embedding layer using a diverse corpus of 760 million tokens while keeping the other layers frozen. The training was performed with a global batch size of 128, and the training loss decreased from 5.907 before training to 3.429.

尽管目前主流的词表扩充方案基于成本考虑通常不再单独针对词向量嵌入层进行预先训练，而是依靠在进行持续预训练时对于词向量嵌入层的更新来实现词向量的对齐。但我们的研究发现：

完全将新增词向量进行随机初始化，而后在持续预训练阶段进行语义对齐的方案会导致一部分预训练数据难以被模型真正学到，该问题在预训练数据是经过精心清洗的高质量数据集的情况下尤其明显。
冻结其他层，使用不同于持续预训练阶段的多样性数据集（尤其是包含多语平行语料的数据集）仅针对词向量嵌入层进行训练有助于提升模型的语义理解能力。尤其是当预训练阶段使用高质量小规模数据集的情况下，使用更具多样性的、未经人工过滤的数据集预先训练词向量嵌入层有助于提升模型的语义理解能力和抗毒性能力。

鉴于此我们在冻结其他层的情况下，使用了 760M Token 的多样性语料进行了 1 个 Epoch 的词向量嵌入层预训练。训练中使用了 128 的全局 Btach Size，train/loss 从训练前的 5.907 降低到 3.429。

Pre Traning/预训练

We performed continuous pretraining on the model using a high-quality corpus of 2.45 billion tokens. The English portion of the corpus was derived from a carefully curated diverse subset of the Falcon RefinedWeb dataset, while the code data came from a specific subset of the bigcode/the-stack dataset. The Chinese portion of the corpus consisted of a specific subset of the mc4 dataset, a curated subset of Chinese Wikipedia, and Pleisto's proprietary collection of public books and papers. During the data preprocessing stage, we employed a series of heuristic methods to clean and deduplicate the data, and used an in-house proprietary model to score the quality of the corpus and align the diversity distribution.

Furthermore, our experiments revealed that the order of the training data significantly impacts the final model performance. Therefore, we sorted the training data using a heuristic algorithm based on the principle of "easy first, difficult later," rather than employing a random shuffling strategy.

During continuous pretraining, we used a sequence length of 4096 and a global batch size of 128. We utilized the 32-bit Lion optimizer with a constant learning rate of 7x10-5. Due to hardware resource limitations, we conducted the training using Lora with a rank of 64. In addition to training all the linear layers, we also trained the embed_token and lm_head layers. However, unlike QLora, we used fp16 precision for training.

我们仅使用了 2.45B 的高质量语料对模型进行了持续预训练，其中英文部分语料来自Falcon RefinedWeb 数据集的一个经过精心策划的多样性子集、代码语料来自 bigcode/the-stack 数据集的一个特定子集，中文部分语料则由 mc4 的一个特定子集、中文维基百科精选子集和 Pleisto 自有的公版书籍与论文数据集共同组成。在数据预处理阶段，我们采用了一系列启发式的方法来针对数据进行清洗和去重并使用自有的闭源模型对于语料质量进行打分评估和多样性分布对齐。

此外我们的实验发现训练数据的顺序也会对最终模型性能造成明显影响，因此我们以「先易后难」的原则使用启发式算法对于训练数据进行了排序而没有采用随机洗牌的策略。

我们在 4096 的序列长度下，使用 128 的全局 Btach Size 进行了持续预训练并使用了 32bit Lion 优化器和7x10-5 的恒定学习率。由于硬件资源上的限制，该阶段的训练采用了 Lora 进行，rank 为 64，除了全部的线性层外还额外训练了 embed_token 和 lm_head 层。但不同于 QLora，我们采用了 fp16 精度进行训练。

Supervise Fine-tuning/有监督微调

We conducted supervised fine-tuning in two stages. In the first stage, we trained the model for one epoch using a more diverse dataset consisting of 1.8 million samples. A significant portion of this dataset was constructed from a subset of the Orca-style instruction dataset, which is based on flan2021 and COIG (inspired by the Microsoft Orca paper). Additionally, it included subsets from the following publicly available datasets:

GSM-8k
OpenAssistant/oasst1
b-mc2/sql-create-context
flan2021
niv0
COIG
TheoremQA

In the second stage, we performed an additional two epochs of training using a highly curated subset of 500,000 samples that underwent multiple verification steps.

我们的有监督微调分 2 个阶段进行，首先使用了一个更具多样性的含有 180 万条数据的数据集训练了 1 个 epoch。该数据集的很大一部分是由基于 flan2021 和 COIG 的一个子集所构建的 Orca 风格指令数据集组成（受微软 Orca 论文的启发）。此外还涵盖了如下的公开数据集的子集：

GSM-8k
OpenAssistant/oasst1
b-mc2/sql-create-context
flan2021
niv0
COIG
TheoremQA

在第二阶段我们使用了一个由 50 万条经过多重校验的高质量子集进行了额外的 2 个 epoch 的训练。

Benchmark Evaluation/性能评测

GSM8k

Model	Score
Llama2-13b	28.7
YuRen-13b	34.42
Llama1-30b	35.6
ChatGLM2-6b	28.05
Baichuan 13b - Chat	26.6
InternLM 7b	31.2
GPT-3.5	57.1

AGIEval-English

Model	Avg/平均	AquA-RAT	LogiQA-en	LSAT-AR	LSAT-LR	LSAT-RC	SAT-en	SAT-en(w/o Psg.)	SAT-math
Llama2-13b	39.1	21.7	38.1	23.0	41.0	54.6	62.1	46.1	27.3
YuRen-13b	39.6	26.77	37.33	24.35	36.86	48.7	69.42	46.6	26.82
Llama1-30b	41.7	18.9	37.3	18.7	48.0	59.5	74.8	44.7	35
GPT-3.5	57.1	31.3	43.5	25.7	59.2	67.7	81.1	53.9	40.9

C-Eval 中文能力

Model	Avg/平均	Avg/平均（Hard）	STEM	社会科学	人文科学	其他
Llama2-13b	39.1	21.7	38.1	23.0	41.0	54.6
YuRen-13b	40.4	28.2	36.9	48.8	40.7	38.9
Llama1-30b	41.7	18.9	37.3	18.7	48.0	59.5
GPT-3.5	57.1	31.3	43.5	25.7	59.2	67.7
Baichuan-13B-Chat	51.5	/	43.5	64.6	56.2	49.2

Limitations and Biases/局限性

YuRen 13B model is primarily designed for the field of information synthesis, including building intelligent agents, natural language understanding, generating SQL, and other business scenarios, rather than directly providing services to the public. We strongly recommend applying this model to internal data processing scenarios within enterprises, rather than public environments.

While we have made every effort to ensure the compliance of the data used during the model training process, unforeseen issues may arise due to the complexity of the model and data. We disclaim any responsibility for any problems caused by the use of the YuRen 13B open-source model, including but not limited to data security issues, public opinion risks, or any risks and issues arising from the model being misled, abused, disseminated, or improperly utilized.

We strongly advise implementing additional security measures when using the model, such as filtering, reviewing, or restricting the inputs and outputs of the model, to prevent harm to users. Technological development should take place in a regulated and lawful environment, and we hope that all users uphold this principle. We will continue to improve the training and use of the model to enhance its security and effectiveness.

羽人 13B 模型主要设计用于信息合成领域，包括构建智能代理、自然语言理解、生成SQL等业务场景，而并非直接用于向公众提供服务。我们强烈建议将此模型应用于企业内部的数据处理场景，而不是公开环境。

虽然我们已经尽可能确保模型训练过程中的数据合规性，但由于模型和数据的复杂性，可能存在无法预见的问题。如果由于使用羽人 13B 开源模型而导致的任何问题，包括但不限于数据安全问题、公共舆论风险，或模型被误导、滥用、传播或不当利用所带来的任何风险和问题，我们不承担任何责任。

我们强烈建议在使用模型时采用额外的安全措施，如对模型的输入输出进行过滤、审查或限制，以免对用户造成伤害。科技的发展应在规范和合法的环境下进行，我们希望所有使用者都能秉持这一原则。我们将持续改进模型的训练和使用，以提升其安全性和有效性。

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	55.39
AI2 Reasoning Challenge (25-Shot)	53.07
HellaSwag (10-Shot)	78.03
MMLU (5-Shot)	56.34
TruthfulQA (0-shot)	42.32
Winogrande (5-shot)	74.43
GSM8k (5-shot)	28.13

pleisto
/

yuren-13b-chatml