Announcement

Due to a combination of factors, we have lost confidence in training open-source LLMs and sharing our results with the community. These factors include, but are not limited to:

Unreasonable community feedback: We have encountered negativity and unrealistic expectations from certain segments of the community, which has been discouraging.
Persistent technical gap with OpenAI: While we have achieved some progress in specific narrow domains, the overall technical gap with OpenAI's top models remains significant and may even be widening. We believe the community's perception of our progress is overly optimistic.
Challenges in GPU resource integration: We have encountered issues with resource allocation and scheduling conflicts with other projects, which has led to extended development cycles.

Despite our cautious and somewhat pessimistic outlook on the field, we are still willing to release and share some of our model weights with the community and showcase our progress in specific narrow domains.

Important Disclaimer:

The scores presented here are for reference only. Any victories achieved in narrow domains are conditional and limited by factors such as scaling laws and data availability. We cannot yet compete with OpenAI's top models in terms of overall performance. Therefore, these scores do not represent the comprehensive capabilities of our models and should only be considered as analytical indicators. Please avoid over-interpreting benchmark results commonly used in the community. Our experiments have revealed their vulnerability and limited ability to comprehensively assess model capabilities. This further highlights the significant gap between our models and OpenAI's top models.

Additional Information:

Our models were not trained on any test sets.
The training data includes some web-crawled data, rewritten and synthesized using GPT-4-32K and GPT-3.5-16K.
We observed that common web-crawled datasets do not actively avoid or filter out questions from popular benchmarks. We have only avoided verbatim repetition of these questions. We currently lack the capability for further filtering based on semantics.
Contamination detection results indicate that our models are safe.

CausalLM-34b-β2

This model is not based on CausalLM-34b-β. It was trained on a different dataset composition. Therefore, both versions are considered equal and were candidates for final release. We encourage experimentation with various model merging approaches.

Training Date:

March 8, 2024

Internal Codename:

M-16 (the 16th finetune of yi-34b)

Theoretical Context Length: 200K (Please modify config.json, 8K by default to prevent OOM)

Chat Template:

chatml (Note: techniques similar to OpenChat's C-RLFT were used, and training was not specifically targeted towards general task-oriented systems. Outputs may not be optimal without the "You are a helpful assistant." prompt or a blank system prompt.)

Special Note:

This model is sensitive to precision. Quantization may cause significant performance degradation. Avoid using wiki-text if using calibrated quantization.

Lm-evaluation-harness Reference (using open_llm_leaderboard parameters, no chat template):

Task	Score
ARC	68.3
HellaSwag	83.6
MMLU	84
TruthfulQA	54
Winogrande	80.4
GSM8K	60.4

MT-Bench Reference:

First Turn

Model	Turn	Score
gpt-4-0125-preview	1	9.16250
gpt-4	1	8.95625
M-16	1	8.82500

Second Turn

Model	Turn	Score
gpt-4	2	9.02500
gpt-4-0125-preview	2	8.93750
M-16	2	8.556962

Average Score

Model	Score
gpt-4-0125-preview	9.05000
gpt-4	8.990625
M-16	8.691824

公告

由于多种因素的影响，我们对训练开源大型语言模型并与社区分享成果的信心有所下降。这些因素包括但不限于：

不合理的社区反馈：我们遇到了一些社区成员的负面情绪和不切实际的期望，这令人沮丧。
与 OpenAI 的持续技术差距：尽管我们在某些特定领域取得了一些进展，但与 OpenAI 顶级模型的整体技术差距仍然很大，甚至可能还在扩大。我们认为社区对我们进展的看法过于乐观。
GPU 资源整合的挑战：我们在资源分配和与其他项目的调度冲突方面遇到了问题，导致开发周期延长。

尽管我们对该领域持谨慎和略微悲观的态度，但我们仍然愿意发布和分享我们的一些模型权重，并展示我们在特定领域取得的进展。

重要免责声明：

此处提供的分数仅供参考。在特定领域取得的任何胜利都是有条件的，并受到诸如规模法则和数据可用性等因素的限制。就整体性能而言，我们还无法与 OpenAI 的顶级模型竞争。因此，这些分数并不代表我们模型的综合能力，而应仅被视为分析指标。请避免过度解读社区中常用的基准测试结果。我们的实验揭示了它们的脆弱性和全面评估模型能力的有限能力。这进一步突出了我们的模型与 OpenAI 顶级模型之间的显着差距。

补充信息：

我们的模型没有在任何测试集上进行训练。
训练数据包括一些网络爬取数据，使用 GPT-4-32K 和 GPT-3.5-16K 重写和合成。
我们观察到常见的网络爬取数据集并不会主动避免或过滤掉来自流行基准测试的问题。我们只避免了逐字重复这些问题。我们目前缺乏基于语义进行进一步过滤的能力。
污染检测结果表明我们的模型是安全的。

CausalLM-34b-β2

此模型并非基于 CausalLM-34b-β。它是在不同的数据集组合上训练的。因此，这两个版本被认为是相等的，并且都是最终发布的候选者。我们鼓励尝试各种模型合并方法。

训练日期：

2024 年 3 月 8 日

内部代号：

M-16（yi-34b 的第 16 次微调）

理论上下文长度：200K（请修改 config.json，默认为 8K 以防止 OOM）

聊天模板：

chatml（注意：使用了类似于 OpenChat 的 C-RLFT 的技术，并且训练没有专门针对通用的面向任务的系统。如果没有“你是一个有用的助手。”提示或空白系统提示，输出可能不是最佳的。）

特别注意：

该模型对精度敏感。量化可能会导致性能显着下降。如果使用校准量化，请避免使用 wiki-text。

Lm-evaluation-harness 参考（使用 open_llm_leaderboard 参数，没有聊天模板）：

任务	分数
ARC	68.3
HellaSwag	83.6
MMLU	84
TruthfulQA	54
Winogrande	80.4
GSM8K	60.4

MT-Bench 参考：

第一轮

模型	轮次	分数
gpt-4-0125-preview	1	9.16250
gpt-4	1	8.95625
M-16	1	8.82500

第二轮

模型	轮次	分数
gpt-4	2	9.02500
gpt-4-0125-preview	2	8.93750
M-16	2	8.556962

平均分数

模型	分数
gpt-4-0125-preview	9.05000
gpt-4	8.990625
M-16	8.691824

CausalLM
/

34b-beta2

Announcement

CausalLM-34b-β2

Training Date:

Internal Codename:

Chat Template:

Special Note:

Lm-evaluation-harness Reference (using open_llm_leaderboard parameters, no chat template):

MT-Bench Reference:

公告

CausalLM-34b-β2

训练日期：

内部代号：

聊天模板：

特别注意：

Lm-evaluation-harness 参考（使用 open_llm_leaderboard 参数，没有聊天模板）：

MT-Bench 参考：

Collection including CausalLM/34b-beta2

Farewell Gifts