|
--- |
|
license: mit |
|
language: |
|
- my |
|
pipeline_tag: text-generation |
|
metrics: |
|
- code_eval |
|
library_name: transformers |
|
tags: |
|
- burmese |
|
- gpt2 |
|
- pre-trained |
|
--- |
|
|
|
The Simbolo's Myanmarsar-GPT symbol is pre-trained on a dataset of 20,000 Burmese data and pre-trained using the GPT-2 architecture of MGPT Model. Its purpose is to serve as a foundational pre-trained model for the Burmese language, facilitating fine-tuning for specific applications of different tasks such as creative writing, chatbot, machine translation etc. |
|
|
|
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6598b82502c4796342239a35/rFId3-xyzWW-juDq_er9k.jpeg) |
|
|
|
|
|
|
|
|
|
### How to use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Simbolo-Servicio/Myanmarsar-GPT") |
|
model = AutoModelForCausalLM.from_pretrained("Simbolo-Servicio/Myanmarsar-GPT") |
|
|
|
input_text = "" |
|
input_ids = tokenizer.encode(input_text, return_tensors='pt') |
|
output = model.generate(input_ids, max_length=50) |
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
``` |
|
### Data |
|
We use most data of 20,000 Burmese sentences from our open source [data](https://huggingface.co/datasets/Simbolo-Servicio/wiki-burmese-sentences) which contains 100,000 sentences sourced from Wikipedia. |
|
|
|
### Contributors |
|
Main Contributor: [Sa Phyo Thu Htet](https://github.com/SaPhyoThuHtet) |
|
Wikipedia Data Crawling: Kaung Kaung Ko Ko, Phuu Pwint Thinzar Kyaing |
|
Releasing the Model: Eithandaraung, Ye Yint Htut, Thet Chit Su, Naing Phyo Aung, Nyan Linn Phyo Zaw, Lynn Thu Kha |
|
|
|
### Acknowledgment |
|
We extend our gratitude to the creators of the [mGPT-XL](ai-forever/mGPT) models for their invaluable contribution to this project, significantly impacting the field of Burmese NLP. |
|
We want to thank everyone who has worked on the related works, especially [Minsithu](https://huggingface.co/jojo-ai-mst/MyanmarGPTT) and |
|
[Dr. Wai Yan Nyein Naing](https://huggingface.co/WYNN747/Burmese-GPT)who initiated the work of gpt-2 model. |
|
And We would like to thank Simbolo:Servico which is a branch of Simbolo under the company of Intello Tech for providing financial support. |
|
|
|
### Limitations and Bias |
|
We have yet to investigate the potential bias inherent in this model thoroughly. Regarding transparency, it's important to note that the model is primarily trained on data from the Unicode Burmese(Myanmar) language. |
|
|
|
### References |
|
1. Jiang, Shengyi & Huang, Xiuwen & Cai, Xiaonan & Lin, Nankai. (2021). Pre-trained Models and Evaluation Data for the Myanmar Language. 10.1007/978-3-030-92310-5_52. |
|
2. Lin, N., Fu, Y., Chen, C., Yang, Z., & Jiang, S. (2021). LaoPLM: Pre-trained Language Models for Lao. ArXiv. /abs/2110.05896 |
|
3. MinSithu, MyanmarGPT, https://huggingface.co/jojo-ai-mst/MyanmarGPT, 1.1-SweptWood |
|
4. Wai Yan Nyein Naing, WYNN747/Burmese-GPT, https://huggingface.co/WYNN747/Burmese-GPT |
|
5. Sai Htaung Kham,saihtaungkham/BurmeseRoBERTaCLM |
|
6. Shliazhko, O., Fenogenova, A., Tikhonova, M., Mikhailov, V., Kozlova, A., & Shavrina, T. (2022). MGPT: Few-Shot Learners Go Multilingual. ArXiv. /abs/2204.07580 |
|
|
|
### How to Cite this work: |
|
Sa Phyo Thu Htet, Simbolo (2023). Myanmarsar-GPT, https://huggingface.co/Simbolo-Servicio/Myanmarsar-GPT/ |