File size: 3,018 Bytes
99907ef
 
dcf5b2f
 
 
14d47b5
 
 
 
 
 
 
99907ef
dcf5b2f
51f1727
b82d261
2cb28eb
 
 
 
 
1650184
2cb28eb
1650184
 
6341760
1650184
 
 
 
38eaf08
6341760
 
2cb28eb
f1f4a97
9f31f81
f1f4a97
8040e14
f1f4a97
92691f4
 
 
9f31f81
f1f4a97
2cb28eb
92691f4
38eaf08
1b51419
981e4f6
1b51419
 
 
92691f4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
license: mit
language:
- my
pipeline_tag: text-generation
metrics:
- code_eval
library_name: transformers
tags:
- burmese
- gpt2
- pre-trained
---

The Simbolo's Myanmarsar-GPT symbol is trained on a dataset of 1 million Burmese data and pre-trained using the GPT-2 architecture. Its purpose is to serve as a foundational pre-trained model for the Burmese language, facilitating fine-tuning for specific applications of different tasks such as creative writing, chatbot, machine translation etc.
![myanamarsar-gpt](https://huggingface.co/Simbolo-Servicio/Myanmarsar-GPT/blob/main/siimobolo%20%E1%80%99%E1%80%BC%E1%80%94%E1%80%BA%E1%80%99%E1%80%AC%20Gpt%20-%202.jpg)


### How to use

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Simbolo-Servicio/Myanmarsar-GPT")
model = AutoModelForCausalLM.from_pretrained("Simbolo-Servicio/Myanmarsar-GPT")

input_text = ""
input_ids = tokenizer.encode(input_text, return_tensors='pt')
output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### Data
The data utilized comprises 1 million sentences sourced from Wikipedia.

### Contributors
Main Contributor: [Sa Phyo Thu Htet](https://github.com/SaPhyoThuHtet)
Wikipedia Data Crawling: Kaung Kaung Ko Ko, Phuu Pwint Thinzar Kyaing
Releasing the Model: Eithandaraung, Ye Yint Htut, Thet Chit Su, Naing Phyo Aung, Nyan Linn Phyo Zaw, Lynn Thu Kha

### Acknowledgment
We extend our gratitude to the creators of the [mGPT-XL](ai-forever/mGPT) models for their invaluable contribution to this project, significantly impacting the field of Burmese NLP.
We want to thank everyone who has worked on the related works, especially [Minsithu](https://huggingface.co/jojo-ai-mst/MyanmarGPTT) and [Dr. Wai Yan Nyein Naing](WYNN747/Burmese-GPT, https://huggingface.co/WYNN747/Burmese-GPT)who initiated the work of gpt-2 model.
And We would like to thank Simbolo:Servico which is a brach of Simbolo under the company of Intello Tech for providing financial support.

### Limitations and bias
We have yet to investigate the potential bias inherent in this model thoroughly. Regarding transparency, it's important to note that the model is primarily trained on data from the Unicode Burmese(Myanmar) language.

### References
1. Jiang, Shengyi & Huang, Xiuwen & Cai, Xiaonan & Lin, Nankai. (2021). Pre-trained Models and Evaluation Data for the Myanmar Language. 10.1007/978-3-030-92310-5_52. 
2. Lin, N., Fu, Y., Chen, C., Yang, Z., & Jiang, S. (2021). LaoPLM: Pre-trained Language Models for Lao. ArXiv. /abs/2110.05896
3. MinSithu, MyanmarGPT, https://huggingface.co/jojo-ai-mst/MyanmarGPT, 1.1-SweptWood 
4. Dr. Wai Yan Nyein Naing, WYNN747/Burmese-GPT, https://huggingface.co/WYNN747/Burmese-GPT
5. Sai Htaung Kham,saihtaungkham/BurmeseRoBERTaCLM
6. Shliazhko, O., Fenogenova, A., Tikhonova, M., Mikhailov, V., Kozlova, A., & Shavrina, T. (2022). MGPT: Few-Shot Learners Go Multilingual. ArXiv. /abs/2204.07580