Text Generation
Transformers
Safetensors
Thai
English
llama
conversational
Eval Results
Inference Endpoints
text-generation-inference
File size: 12,440 Bytes
fbde074
eb0c27e
 
 
fbde074
f6bed44
 
 
 
eb0c27e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fbde074
f6bed44
 
 
 
 
 
 
 
bc0d7f3
f6bed44
 
 
 
 
 
 
e1ae3a8
bc0d7f3
ef1cf3f
fbe817b
f6bed44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91bb51b
f6bed44
 
 
393276a
91bb51b
393276a
91bb51b
b129131
393276a
f6bed44
 
 
 
 
 
7c22052
f6bed44
 
 
 
 
ef1cf3f
 
 
f6bed44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06c4df9
f6bed44
 
 
 
 
 
 
 
 
 
 
ef1cf3f
 
 
 
 
 
 
f6bed44
 
eb0c27e
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
language:
- th
- en
license: llama2
datasets:
- HuggingFaceH4/ultrachat_200k
- HuggingFaceH4/ultrafeedback_binarized
- HuggingFaceH4/cai-conversation-harmless
model-index:
- name: SambaLingo-Thai-Chat
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge (25-Shot)
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc_norm
      value: 52.73
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sambanovasystems/SambaLingo-Thai-Chat
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag (10-Shot)
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc_norm
      value: 78.42
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sambanovasystems/SambaLingo-Thai-Chat
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU (5-Shot)
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 43.95
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sambanovasystems/SambaLingo-Thai-Chat
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA (0-shot)
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: mc2
      value: 40.84
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sambanovasystems/SambaLingo-Thai-Chat
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande (5-shot)
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 72.22
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sambanovasystems/SambaLingo-Thai-Chat
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k (5-shot)
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 8.57
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sambanovasystems/SambaLingo-Thai-Chat
      name: Open LLM Leaderboard
---



# SambaLingo-Thai-Chat

<img src="SambaLingo_Logo.png" width="340" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

<!-- Provide a quick summary of what the model is/does. -->
SambaLingo-Thai-Chat is a human aligned chat model trained in Thai and English. It is trained using direct preference optimization on top the base model [SambaLingo-Thai-Base](https://huggingface.co/sambanovasystems/SambaLingo-Thai-Base). The base model adapts [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) to Thai by training on 38 billion tokens from the Thai split of the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset. Try This Model at [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).

## Model Description
<!-- Provide a longer summary of what this model is. -->

- **Developed by:** [SambaNova Systems](https://sambanova.ai/)
- **Model type:** Language Model
- **Language(s):** Thai, English
- **Finetuned from model:** [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- **Try This Model:** [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space)
- **Paper:** [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
- **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)

## Getting Started

### Loading Model With Hugging Face
Please make sure to set use_fast=False when loading the tokenizer.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Thai-Chat", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Thai-Chat", device_map="auto", torch_dtype="auto")
```

### Interacting With Model Pipeline
Please make sure to set use_fast=False when loading the tokenizer.
```python
from transformers import pipeline
pipe = pipeline("text-generation", model="sambanovasystems/SambaLingo-Thai-Chat", device_map="auto", use_fast=False)
messages = [
                {"role": "user", "content": {YOUR_QUESTION}},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt)[0]
outputs = outputs["generated_text"]
```

### Suggested Inference Parameters
- Temperature: 0.8
- Repetition penalty: 1.0
- Top-p: 0.9

### Prompting Guidelines
To prompt this model, please use the following chat template:
```
<|user|>\n{question}</s>\n<|assistant|>\n
```

### Example Prompts and Generations
```
<|user|>
ปรัชญาทางเศรษฐกิจที่พระบาทสมเด็จพระมหาภูมิพลอดุลยเดชมหาราช บรมนาถบพิตรมีพระราชดำรัสแก่ชาวไทย มีชื่อว่าอะไร</s>
<|assistant|>
ปรัชญาทางเศรษฐกิจที่พระบาทสมเด็จพระปรมินทรมหาภูมิพลอดุลยเดชทรงมีพระราชดำรัสแก่ชาวไทยเรียกว่า "ปรัชญาเศรษฐกิจพอเพียง" ปรัชญานี้เน้นความสำคัญของการพัฒนาแบบองค์รวมแ    ละความสมดุล โดยคำนึงถึงสภาพแวดล้อมทางสังคม เศรษฐกิจ และสิ่งแวดล้อม ปรัชญาเศรษฐกิจพอเพียงเน้นความสำคัญของการพึ่งพาตนเอง ความมีเหตุผล และการประหยัด ตลอดจนความสำคัญของก    ารพัฒนาจิตใจและจิตวิญญาณ ปรัชญานี้เน้นย้ำถึงความสำคัญของการดำเนินชีวิตอย่างพอประมาณ มีเหตุผล และมีคุณธรรม ตลอดจนความสำคัญของการพัฒนาความรู้และทักษะอย่างต่อเนื่อง ปรัชญาเศรษฐกิ    จพอเพียงได้รับการยกย่องอย่างกว้างขวางว่าเป็นแนวทางแก้ไขปัญหาการพัฒนาที่ยั่งยืนและเป็นแนวทางในการดำเนินชีวิตที่นำไปสู่ความสุขและความเป็นอยู่ที่ดี
```

## Training Details
The alignment phase follows the recipe for [Zephyr-7B](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), and comprises two stages: supervised fine-tuning (SFT) and Direct Performance Optimization (DPO).

The SFT phase was done on the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset mixed with the Google translated version of the ultrachat_200k dataset. It was trained for one epoch with global batch size 512 and max sequence length 2048 tokens. We used a linear decay learning rate of 2e-5 and 10% warmup.

The DPO phase was done on the [ultrafeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset and [cai-conversation-harmless](https://huggingface.co/datasets/HuggingFaceH4/cai-conversation-harmless) dataset, mixed with 10% of the data Google translated. It was trained with global batch size 32 and for three epochs. We used a linear decay learning rate of 5e-7, 10% warmup and β=0.1 as the regularization factor for DPO. 


## Tokenizer Details
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.

## Evaluation 
For evaluation results see our paper: [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)

## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
Use of this model is governed by the Meta’s [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/). Please review and accept the license before downloading the model weights.


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
SambaLingo should NOT be used for:

- Mission-critical applications
- Applications that involve the safety of others
- Making highly important decisions

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Like all LLMs, SambaLingo has certain limitations:
- Hallucination: Model may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
- Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
- Repetition: The Model may produce repetitive phrases or sentences, leading to less engaging and informative responses.
- Coding and Math: The model's performance in generating accurate code or solving complex mathematical problems may be limited.
- Toxicity: The model could inadvertently generate responses containing inappropriate or harmful content.

## Acknowledgments
We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been possible without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative.

We would like to give a special thanks to the following groups:
- Meta for open sourcing LLama 2 and open sourcing FLORES-200 dataset
- Nguyen et al for open sourcing CulturaX dataset
- CohereAI for releasing AYA-101 and open sourcing a multilingual instruction tuning dataset
- EleutherAI for their open source evaluation framework
- Hugging Face-H4 team for open source the zephyr training recipe and alignment handbook repo


## Cite SambaLingo
```
@misc{csaki2024sambalingo,
      title={SambaLingo: Teaching Large Language Models New Languages}, 
      author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
      year={2024},
      eprint={2404.05829},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_sambanovasystems__SambaLingo-Thai-Chat)

|             Metric              |Value|
|---------------------------------|----:|
|Avg.                             |49.45|
|AI2 Reasoning Challenge (25-Shot)|52.73|
|HellaSwag (10-Shot)              |78.42|
|MMLU (5-Shot)                    |43.95|
|TruthfulQA (0-shot)              |40.84|
|Winogrande (5-shot)              |72.22|
|GSM8k (5-shot)                   | 8.57|