JosephusCheung commited on
Commit
26a251b
·
1 Parent(s): 1494f15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -1
README.md CHANGED
@@ -34,6 +34,8 @@ tags:
34
 
35
  *Image drawn by GPT-4 DALL·E 3* TL;DR: Perhaps better than all existing models < 70B, in most quantitative evaluations...
36
 
 
 
37
  **llama.cpp GGUF models**
38
  GPT2Tokenizer fixed by [Kerfuffle](https://github.com/KerfuffleV2) on [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743), new models are now reuploaded.
39
 
@@ -93,9 +95,19 @@ Hard ACC:54.71
93
  | ------------ | -------- | -------------- | ------ | ----------- | ------- | ------- | --------- | ---------- |
94
  | causallm-14b | **88.26087** | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
95
 
96
-
97
  Win rate **88.26%** on [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
98
 
 
 
 
 
 
 
 
 
 
 
 
99
  **llama.cpp GGUF models**
100
  GPT2Tokenizer 支持由 [Kerfuffle](https://github.com/KerfuffleV2) 修复于 [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743),新模型稍后上传。
101
 
@@ -155,3 +167,12 @@ STEM准确率:66.71
155
  | causallm-14b | **88.26087** | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
156
 
157
  在 [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) 胜率 **88.26%** [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
 
 
 
 
 
 
 
 
 
 
34
 
35
  *Image drawn by GPT-4 DALL·E 3* TL;DR: Perhaps better than all existing models < 70B, in most quantitative evaluations...
36
 
37
+ # CausalLM 14B
38
+
39
  **llama.cpp GGUF models**
40
  GPT2Tokenizer fixed by [Kerfuffle](https://github.com/KerfuffleV2) on [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743), new models are now reuploaded.
41
 
 
95
  | ------------ | -------- | -------------- | ------ | ----------- | ------- | ------- | --------- | ---------- |
96
  | causallm-14b | **88.26087** | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
97
 
 
98
  Win rate **88.26%** on [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
99
 
100
+ ## Other languages
101
+ We are currently unable to produce accurate benchmark templates for non-QA tasks (languages other than English and Chinese). However, we will be working on other language versions of the QA-Task challenge in the near future.
102
+ ### Japanese Benchmark
103
+ | Task |Version| Metric |Value | |Stderr|
104
+ |----------------------|------:|--------|-----:|---|-----:|
105
+ |jcommonsenseqa-1.1-0.6| 1.1|acc |0.8213|± |0.0115|
106
+
107
+ *jcommonsenseqa benchmark result is very, very close to [Japanese Stable LM Gamma 7B (83.47)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable), current SOTA Japanese LM. However, our model was not trained on a particularly large amount of text in Japanese. This seems to reflect the cross-language transferability of metalinguistics.*
108
+
109
+ # 中文说明
110
+
111
  **llama.cpp GGUF models**
112
  GPT2Tokenizer 支持由 [Kerfuffle](https://github.com/KerfuffleV2) 修复于 [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743),新模型稍后上传。
113
 
 
167
  | causallm-14b | **88.26087** | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
168
 
169
  在 [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) 胜率 **88.26%** [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
170
+
171
+ ## 其他语言
172
+ 我们目前无法为非 QA 任务(英语和中文以外的语言)生成准确的基准模板。 不过,我们将在不久的将来开发其他语言版本的 QA-Task 挑战。
173
+ ### 日文基准
174
+ | Task |Version| Metric |Value | |Stderr|
175
+ |----------------------|------:|--------|-----:|---|-----:|
176
+ |jcommonsenseqa-1.1-0.6| 1.1|acc |0.8213|± |0.0115|
177
+
178
+ *jcommonsenseqa 基准测试结果非常非常接近 [Japanese Stable LM Gamma 7B (83.47)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable),当前 SOTA 日文 LM 。然而,我们的模型并未在日文上进行特别的大量文本训练。这似乎能体现元语言的跨语言迁移能力。*