File size: 3,420 Bytes
99907ef dcf5b2f 14d47b5 99907ef 0d226a7 969baad 702bab3 6c35288 0b1090a e211c04 2cb28eb 1650184 2cb28eb 1650184 6341760 ea56ef9 1650184 38eaf08 6341760 f79c70f 2cb28eb f1f4a97 9f31f81 f1f4a97 8040e14 f1f4a97 92691f4 ed63bc5 6d3bc6d e211c04 f1f4a97 3310c06 92691f4 38eaf08 1b51419 981e4f6 1b51419 547092f 07a8881 d0098d6 aa8985e 6ff6f9d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
---
license: mit
language:
- my
pipeline_tag: text-generation
metrics:
- code_eval
library_name: transformers
tags:
- burmese
- gpt2
- pre-trained
---
The Simbolo's Myanmarsar-GPT (it is not a chatbot but a text generation model which can be used to develop chatbot) is pre-trained on a dataset of 20,000 Burmese data and pre-trained using the GPT-2 architecture of MGPT Model. Its purpose is to serve as a foundational pre-trained model for the Burmese language, facilitating fine-tuning for specific applications of different tasks such as creative writing, chatbot, machine translation etc.
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6598b82502c4796342239a35/rFId3-xyzWW-juDq_er9k.jpeg)
### How to use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Simbolo-Servicio/Myanmarsar-GPT")
model = AutoModelForCausalLM.from_pretrained("Simbolo-Servicio/Myanmarsar-GPT")
input_text = "ပညာရေး"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### Data
We use 20,000 Burmese sentences and most are from our open-source [data](https://huggingface.co/datasets/Simbolo-Servicio/wiki-burmese-sentences) which contains 100,000 sentences sourced from Wikipedia.
### Contributors
Main Contributor: [Sa Phyo Thu Htet](https://github.com/SaPhyoThuHtet)
Wikipedia Data Crawling: Kaung Kaung Ko Ko, Phuu Pwint Thinzar Kyaing
Releasing the Model: Eithandaraung, Ye Yint Htut, Thet Chit Su, Naing Phyo Aung, Nyan Linn Phyo Zaw, Lynn Thu Kha
### Acknowledgment
We extend our gratitude to the creators of the [mGPT-XL](https://huggingface.co/ai-forever/mGPT) models for their invaluable contribution to this project.
We want to thank everyone who has worked on the related works, especially [Minsithu](https://huggingface.co/jojo-ai-mst/MyanmarGPTT) and
[Dr. Wai Yan Nyein Naing](https://huggingface.co/WYNN747/Burmese-GPT)who initiated the work of gpt-2 model.
And We would like to thank Simbolo:Servico which is a branch of Simbolo under the company of Intello Tech for providing financial support.
### Limitations and Bias
We have yet to investigate the potential bias inherent in this model thoroughly. Regarding transparency, it's important to note that the model is primarily trained on data from the Unicode Burmese(Myanmar) language.
### References
1. Jiang, Shengyi & Huang, Xiuwen & Cai, Xiaonan & Lin, Nankai. (2021). Pre-trained Models and Evaluation Data for the Myanmar Language. 10.1007/978-3-030-92310-5_52.
2. Lin, N., Fu, Y., Chen, C., Yang, Z., & Jiang, S. (2021). LaoPLM: Pre-trained Language Models for Lao. ArXiv. /abs/2110.05896
3. MinSithu, MyanmarGPT, https://huggingface.co/jojo-ai-mst/MyanmarGPT, 1.1-SweptWood
4. Wai Yan Nyein Naing, WYNN747/Burmese-GPT, https://huggingface.co/WYNN747/Burmese-GPT
5. Sai Htaung Kham, saihtaungkham/BurmeseRoBERTaCLM
6. Shliazhko, O., Fenogenova, A., Tikhonova, M., Mikhailov, V., Kozlova, A., & Shavrina, T. (2022). MGPT: Few-Shot Learners Go Multilingual. ArXiv. /abs/2204.07580
### How to Cite this work:
### Cite As:
```bibtex
@misc{myanmarsar-gpt,
author = {{Sa Phyo Thu Htet}},
title = {Myanmarsar GPT},
url = {https://huggingface.co/Simbolo-Servicio/Myanmarsar-GPT},
urldate = {2024-1-09},
date = {2024-1-09}
}
``` |