zolicsaki commited on
Commit
92a460d
1 Parent(s): e446f78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -19
README.md CHANGED
@@ -27,7 +27,8 @@ SambaLingo-Japanese-Base is a pretrained Bi-lingual Japanese and English model t
27
  - **Model type:** Language Model
28
  - **Language(s):** Japanese, English
29
  - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
30
- - **Blog Post**: Will be released soon!
 
31
 
32
  ## Getting Started
33
 
@@ -52,17 +53,7 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
52
  We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
53
 
54
  ## Evaluation
55
- || SambaLingo-Japanese-Base | ELYZA-japanese-Llama-2-7b-7b | bloom-7b1 | xglm-7.5B | mGPT-13B |
56
- |------------------------------|------------------------------|-----------|-----------|----------|--------|
57
- | Perplexity (Lower Is Better) | **1.559** | 1.754 | 2.216 | 1.775 | 2.349 |
58
- | FLORES en->ja (8 shot, CHRF) | **0.281** | 0.250 | 0.056 | 0.156 | 0.111 |
59
- | FLORES ja->en (8 shot, CHRF) | **0.495** | 0.436 | 0.262 | 0.369 | 0.297 |
60
- | FLORES en->ja (8 shot, BLEU) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
61
- | FLORES ja->en (8 shot, BLEU) | **0.184** | 0.144 | 0.043 | 0.084 | 0.036 |
62
- | Belebele (3 shot) | 36.56% | **53.67%** | 26.67% | 24.00% | 22.89% |
63
- | SIB-200 (3 shot) | 68.63% | **74.02%** | 60.29% | 60.78% | 41.18% |
64
- | PAWS-X | 46.80% | 50.50% | 45.40% | **51.95%** | 45.20% |
65
- | XWinograd (0 shot) | 76.64% | **77.58%** | 58.92% | 64.96% | 57.77% |
66
 
67
  ## Uses
68
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
@@ -105,12 +96,12 @@ We would like to give a special thanks to the following groups:
105
 
106
  ## Cite SambaLingo
107
  ```
108
- @software{sambalingo,
109
- title = {{SambaLingo: Open Source Language Experts}},
110
- author = {SambaNova Systems},
111
- url = {https://huggingface.co/sambanovasystems/SambaLingo-Japanese-Base}
112
- month = {2},
113
- year = {2024},
114
- version = {1.0},
115
  }
116
  ```
 
27
  - **Model type:** Language Model
28
  - **Language(s):** Japanese, English
29
  - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
30
+ - **Paper:** [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
31
+ - **Blog Post**: [SambaLingo Open Source Language Experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
32
 
33
  ## Getting Started
34
 
 
53
  We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
54
 
55
  ## Evaluation
56
+ For evaluation results see our paper: [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
 
 
 
 
 
 
 
 
 
 
57
 
58
  ## Uses
59
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
96
 
97
  ## Cite SambaLingo
98
  ```
99
+ @misc{csaki2024sambalingo,
100
+ title={SambaLingo: Teaching Large Language Models New Languages},
101
+ author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
102
+ year={2024},
103
+ eprint={2404.05829},
104
+ archivePrefix={arXiv},
105
+ primaryClass={cs.CL}
106
  }
107
  ```