CausalLM
/

14B

@@ -34,6 +34,8 @@ tags:
 *Image drawn by GPT-4 DALL·E 3* TL;DR: Perhaps better than all existing models < 70B, in most quantitative evaluations...
 **llama.cpp GGUF models**
 GPT2Tokenizer fixed by [Kerfuffle](https://github.com/KerfuffleV2) on [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743), new models are now reuploaded.
@@ -93,9 +95,19 @@ Hard ACC:54.71
 | ------------ | -------- | -------------- | ------ | ----------- | ------- | ------- | --------- | ---------- |
 | causallm-14b | **88.26087** | 1.116333       | 705    | 89          | 11      | 805     | community | 1391       |
 Win rate **88.26%**	on [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
 **llama.cpp GGUF models**
 GPT2Tokenizer 支持由 [Kerfuffle](https://github.com/KerfuffleV2) 修复于 [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743)，新模型稍后上传。
@@ -155,3 +167,12 @@ STEM准确率：66.71
 | causallm-14b | **88.26087** | 1.116333       | 705    | 89          | 11      | 805     | community | 1391       |
 在 [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) 胜率 **88.26%** [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)

 *Image drawn by GPT-4 DALL·E 3* TL;DR: Perhaps better than all existing models < 70B, in most quantitative evaluations...
+# CausalLM 14B
 **llama.cpp GGUF models**
 GPT2Tokenizer fixed by [Kerfuffle](https://github.com/KerfuffleV2) on [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743), new models are now reuploaded.
 | ------------ | -------- | -------------- | ------ | ----------- | ------- | ------- | --------- | ---------- |
 | causallm-14b | **88.26087** | 1.116333       | 705    | 89          | 11      | 805     | community | 1391       |
 Win rate **88.26%**	on [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
+## Other languages
+We are currently unable to produce accurate benchmark templates for non-QA tasks (languages other than English and Chinese). However, we will be working on other language versions of the QA-Task challenge in the near future.
+### Japanese Benchmark
+|         Task         |Version| Metric |Value |   |Stderr|
+|----------------------|------:|--------|-----:|---|-----:|
+|jcommonsenseqa-1.1-0.6|    1.1|acc     |0.8213|±  |0.0115|
+*jcommonsenseqa benchmark result is very, very close to [Japanese Stable LM Gamma 7B (83.47)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable), current SOTA Japanese LM. However, our model was not trained on a particularly large amount of text in Japanese. This seems to reflect the cross-language transferability of metalinguistics.*
+# 中文说明
 **llama.cpp GGUF models**
 GPT2Tokenizer 支持由 [Kerfuffle](https://github.com/KerfuffleV2) 修复于 [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743)，新模型稍后上传。
 | causallm-14b | **88.26087** | 1.116333       | 705    | 89          | 11      | 805     | community | 1391       |
 在 [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) 胜率 **88.26%** [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
+## 其他语言
+我们目前无法为非 QA 任务（英语和中文以外的语言）生成准确的基准模板。 不过，我们将在不久的将来开发其他语言版本的 QA-Task 挑战。
+### 日文基准
+|         Task         |Version| Metric |Value |   |Stderr|
+|----------------------|------:|--------|-----:|---|-----:|
+|jcommonsenseqa-1.1-0.6|    1.1|acc     |0.8213|±  |0.0115|
+*jcommonsenseqa 基准测试结果非常非常接近 [Japanese Stable LM Gamma 7B (83.47)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable)，当前 SOTA 日文 LM 。然而，我们的模型并未在日文上进行特别的大量文本训练。这似乎能体现元语言的跨语言迁移能力。*