File size: 10,175 Bytes
72644d2
53833c8
 
 
 
 
 
 
 
 
72644d2
53833c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82c4f4d
 
 
53833c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82c4f4d
 
53833c8
82c4f4d
 
53833c8
 
82c4f4d
 
 
 
53833c8
 
82c4f4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53833c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: apache-2.0
language:
- th
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- openthaigpt
- llama
---

# 🇹🇭 OpenThaiGPT 7b 1.0.0
<img src="https://1173516064-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvvbWvIIe82Iv1yHaDBC5%2Fuploads%2Fb8eiMDaqiEQL6ahbAY0h%2Fimage.png?alt=media&token=6fce78fd-2cca-4c0a-9648-bd5518e644ce
https://openthaigpt.aieat.or.th/" width="200px">

🇹🇭 OpenThaiGPT 7b Version 1.0.0-beta is a Thai language 7B-parameter LLaMA v2 Chat model finetuned to Thai instructions and extend more than 10,000 most popular Thai words vocabularies into LLM's dictionary for turbo speed.

## Features
- Multi-turn Conversation Support
- Retrieval Augmented Generation (RAG) Support
- State-of-the-Art Thai language LLM, Acheive the highest 38.40% average score over all opensource LLMs on 17 Thai exams.

## Benchmark
| **Exams**                        | **OTG 7b (Aug 2023)** | **OTG 13b (Dec 2023)** | **OTG 7b (March 2024)** | **OTG 13b (March 2024)** | **OTG 70b (March 2024)** | **SeaLLM 7b v1** | **SeaLLM 7b v2** | **TyphoonGPT 7b** | **SeaLion 7b** | **WanchanGLM 7b** | **Sailor-7B-Chat** | **GPT3.5** | **GPT4** | **Gemini Pro** | **Gemini 1.5** | **Claude 3 Haiku** | **Claude 3 Sonnet** | **Claude 3 Opus** |
|----------------------------------|-----------------------|------------------------|-------------------------|--------------------------|--------------------------|------------------|------------------|--------------------|----------------|-------------------|--------------------|------------|----------|----------------|----------------|--------------------|---------------------|-------------------|
| **A-Level**                      | 17.50%                | 34.17%                 | 25.00%                  | 30.83%                   | 45.83%                   | 18.33%           | 34.17%           | N/A                | 21.67%         | 17.50%            | 40.00%             | 38.33%     | 65.83%   | 56.67%         | 55.83%         | 58.33%             | 59.17%              | 77.50%            |
| **TGAT**                         | 24.00%                | 22.00%                 | 22.00%                  | 36.00%                   | 36.00%                   | 14.00%           | 28.00%           | N/A                | 24.00%         | 16.00%            | 34.00%             | 28.00%     | 44.00%   | 22.00%         | 28.00%         | 36.00%             | 34.00%              | 46.00%            |
| **TPAT1**                        | 22.50%                | 47.50%                 | 42.50%                  | 27.50%                   | 62.50%                   | 22.50%           | 27.50%           | N/A                | 22.50%         | 17.50%            | 40.00%             | 45.00%     | 52.50%   | 52.50%         | 50.00%         | 52.50%             | 50.00%              | 62.50%            |
| **ic_all_test**                  | 8.00%                 | 28.00%                 | 76.00%                  | 84.00%                   | 68.00%                   | 16.00%           | 28.00%           | N/A                | 24.00%         | 16.00%            | 24.00%             | 40.00%     | 64.00%   | 52.00%         | 32.00%         | 44.00%             | 64.00%              | 72.00%            |
| **facebook_beleble_tha**         | 25.00%                | 45.00%                 | 34.50%                  | 39.50%                   | 70.00%                   | 13.50%           | 51.00%           | N/A                | 27.00%         | 24.50%            | 63.00%             | 50.00%     | 72.50%   | 65.00%         | 74.00%         | 63.50%             | 77.00%              | 90.00%            |
| **xcopa_th_200**                 | 45.00%                | 56.50%                 | 49.50%                  | 51.50%                   | 74.50%                   | 26.50%           | 47.00%           | N/A                | 51.50%         | 48.50%            | 68.50%             | 64.00%     | 82.00%   | 68.00%         | 74.00%         | 64.00%             | 80.00%              | 86.00%            |
| **xnli2.0_tha**                  | 33.50%                | 34.50%                 | 39.50%                  | 31.00%                   | 47.00%                   | 21.00%           | 43.00%           | N/A                | 37.50%         | 33.50%            | 16.00%             | 50.00%     | 69.00%   | 53.00%         | 54.50%         | 50.00%             | 68.00%              | 68.50%            |
| **ONET M3** | 17.85%                | 38.86%                 | 34.11%                  | 39.36%                   | 56.15%                   | 15.58%           | 23.92%           | N/A                | 21.79%         | 19.56%            | 21.37%             | 37.91%     | 49.97%   | 55.99%         | 57.41%         | 52.73%             | 40.60%              | 63.87%            |
| **ONET M6** | 21.14%                | 28.87%                 | 22.53%                  | 23.32%                   | 42.85%                   | 15.09%           | 19.48%           | N/A                | 16.96%         | 20.67%            | 28.64%             | 34.44%     | 46.29%   | 45.53%         | 50.23%         | 34.79%             | 38.49%              | 48.56%            |
|----------------------------------|-----------------------|------------------------|-------------------------|--------------------------|--------------------------|------------------|------------------|--------------------|----------------|-------------------|--------------------|------------|----------|----------------|----------------|--------------------|---------------------|-------------------|
| **Average Score**                | 23.83%                | 37.27%                 | 38.40%                  | 40.33%                   | 55.87%                   | 18.06%           | 33.56%           | N/A                | 27.44%         | 23.75%            | 37.28%             | 43.07%     | 60.68%   | 52.30%         | 52.89%         | 50.65%             | 56.81%              | 68.32%            |

## Licenses
**Source Code**: License Apache Software License 2.0.<br>
**Weight**: Research and **Commercial uses**.<br>

## Sponsors
<img src="https://cdn-uploads.huggingface.co/production/uploads/5fcd9c426d942eaf4d1ebd30/42d-GioSs4evIdNuMAaPB.png" width="600px">

## Supports
- Official website: https://openthaigpt.aieat.or.th
- Facebook page: https://web.facebook.com/groups/openthaigpt
- A Discord server for discussion and support [here](https://discord.gg/rUTp6dfVUF)
- E-mail: kobkrit@aieat.or.th

## Prompt Format
Prompt format is based on Llama2 with a small modification (Adding "###" to specify the context part)
```
<s>[INST] <<SYS>
{system_prompt}
<</SYS>>

{human_turn1}###{context_turn1} [/INST]{assistant_turn1}</s><s>{human_turn2}###{context_turn2} [/INST] ...
```

### System prompt:
```
You are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด
```

### Single Turn Conversation Example
```
<s>[INST] <<SYS>
You are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด
<</SYS>>

สวัสดี [/INST]
```

### Single Turn Conversation with Context (RAG) Example
```
<s>[INST] <<SYS>
You are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด
<</SYS>>

กรุงเทพมีพื้นที่เท่าไร่###กรุงเทพมหานคร เป็นเมืองหลวง นครและมหานครที่มีประชากรมากที่สุดของประเทศไทย กรุงเทพมหานครมีพื้นที่ทั้งหมด 1,568.737 ตร.กม. มีประชากรตามทะเบียนราษฎรกว่า 8 ล้านคน [/INST]
```


## How to use

1. install VLLM (https://github.com/vllm-project/vllm)
2. python -m vllm.entrypoints.api_server --model /path/to/model --tensor-parallel-size num_gpus
3. run inference (CURL example)

```
curl --request POST \
    --url http://localhost:8000/generate \
    --header "Content-Type: application/json" \
    --data '{"prompt": "<s>[INST] <<SYS>>\nYou are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด\n<</SYS>>\n\nอยากลดความอ้วนต้องทำอย่างไร [/INST]","use_beam_search": false, "temperature": 0.1, "max_tokens": 512, "top_p": 0.75, "top_k": 40, "frequency_penalty": 0.3 "stop": "</s>"}'
```

### Authors
* Kobkrit Viriyayudhakorn (kobkrit@aieat.or.th)
* Sumeth Yuenyong (sumeth.yue@mahidol.edu)
* Thaweewat Rugsujarit (thaweewr@scg.com)
* Jillaphat Jaroenkantasima (autsadang41@gmail.com)
* Norapat Buppodom (new@norapat.com)
* Koravich Sangkaew (kwankoravich@gmail.com)
* Peerawat Rojratchadakorn (peerawat.roj@gmail.com)
* Surapon Nonesung (nonesungsurapon@gmail.com)
* Chanon Utupon (chanon.utupon@gmail.com)
* Sadhis Wongprayoon (sadhis.tae@gmail.com)
* Nucharee Thongthungwong (nuchhub@hotmail.com)
* Chawakorn Phiantham (mondcha1507@gmail.com)
* Patteera Triamamornwooth (patt.patteera@gmail.com)
* Nattarika Juntarapaoraya (natt.juntara@gmail.com)
* Kriangkrai Saetan (kraitan.ss21@gmail.com)
* Pitikorn Khlaisamniang (pitikorn32@gmail.com)

<i>Disclaimer: Provided responses are not guaranteed.</i>