zolicsaki commited on
Commit
c07ab44
1 Parent(s): 5d50fcb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -16
README.md CHANGED
@@ -28,6 +28,7 @@ SambaLingo-Hungarian-Base is a pretrained Bi-lingual Hungarian and English model
28
  - **Language(s):** Hungarian, English
29
  - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
30
  - **Try the chat version of this model**: [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
 
31
  - **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
32
 
33
  ## Getting Started
@@ -53,15 +54,7 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
53
  We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
54
 
55
  ## Evaluation
56
- | | SambaLingo-Hungarian-Base | PULI-GPTrio | bloom-7b1 | xglm-7.5B | mGPT-13B |
57
- |------------------------------|-------------|-----------|-----------|----------|--------|
58
- | Perplexity (Lower Is Better) | **1.614** | 1.720 | 3.284 | 3.687 | 2.488 |
59
- | FLORES en->hu (8 shot, CHRF) | **0.496** | 0.422 | 0.113 | 0.019 | 0.200 |
60
- | FLORES hu->en (8 shot, CHRF) | **0.558** | 0.459 | 0.162 | 0.146 | 0.174 |
61
- | FLORES en->hu (8 shot, BLEU) | **0.164** | 0.098 | 0.002 | 0.001 | 0.010 |
62
- | FLORES hu->en (8 shot, BLEU) | **0.261** | 0.154 | 0.004 | 0.004 | 0.009 |
63
- | Belebele (3 shot) | **41.78%** | 27.67% | 26.78% | 24.00% | 24.56% |
64
- | SIB-200 (3 shot) | **57.35%** | 51.96% | 35.78% | 41.18% | 41.18% |
65
 
66
 
67
  ## Uses
@@ -103,12 +96,12 @@ We would like to give a special thanks to the following groups:
103
 
104
  ## Cite SambaLingo
105
  ```
106
- @software{sambalingo,
107
- title = {{SambaLingo: Open Source Language Experts}},
108
- author = {SambaNova Systems},
109
- url = {https://huggingface.co/sambanovasystems/SambaLingo-Hungarian-Base}
110
- month = {2},
111
- year = {2024},
112
- version = {1.0},
113
  }
114
  ```
 
28
  - **Language(s):** Hungarian, English
29
  - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
30
  - **Try the chat version of this model**: [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
31
+ - **Paper:** [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
32
  - **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
33
 
34
  ## Getting Started
 
54
  We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
55
 
56
  ## Evaluation
57
+ For evaluation results see our paper: [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
 
 
 
 
 
 
 
 
58
 
59
 
60
  ## Uses
 
96
 
97
  ## Cite SambaLingo
98
  ```
99
+ @misc{csaki2024sambalingo,
100
+ title={SambaLingo: Teaching Large Language Models New Languages},
101
+ author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
102
+ year={2024},
103
+ eprint={2404.05829},
104
+ archivePrefix={arXiv},
105
+ primaryClass={cs.CL}
106
  }
107
  ```