zolicsaki commited on
Commit
3ad0882
1 Parent(s): f363ea4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -0
README.md CHANGED
@@ -51,6 +51,19 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
51
  ## Tokenizer Details
52
  We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ## Uses
55
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
56
 
 
51
  ## Tokenizer Details
52
  We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
53
 
54
+ ## Evaluation
55
+ || SambaLingo-Japanese-Base | ELYZA-japanese-Llama-2-7b-7b | bloom-7b1 | xglm-7.5B | mGPT-13B |
56
+ |------------------------------|------------------------------|-----------|-----------|----------|--------|
57
+ | Perplexity (Lower Is Better) | 1.559 | 1.754 | 2.216 | 1.775 | 2.349 |
58
+ | FLORES en->ja (8 shot, CHRF) | 0.281 | 0.250 | 0.056 | 0.156 | 0.111 |
59
+ | FLORES ja->en (8 shot, CHRF) | 0.495 | 0.436 | 0.262 | 0.369 | 0.297 |
60
+ | FLORES en->ja (8 shot, BLEU) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
61
+ | FLORES ja->en (8 shot, BLEU) | 0.184 | 0.144 | 0.043 | 0.084 | 0.036 |
62
+ | Belebele (3 shot) | 36.56% | 53.67% | 26.67% | 24.00% | 22.89% |
63
+ | SIB-200 (3 shot) | 68.63% | 74.02% | 60.29% | 60.78% | 41.18% |
64
+ | PAWS-X | 46.80% | 50.50% | 45.40% | 51.95% | 45.20% |
65
+ | XWinograd (0 shot) | 76.64% | 77.58% | 58.92% | 64.96% | 57.77% |
66
+
67
  ## Uses
68
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
69