Request to Train Chat LM Model with our English-Sinhala Translation Dataset

#189
by zaanind - opened

I am reaching out to you with a request to train your chat language model with a specific dataset to enhance its Sinhala language capabilities.

Dataset Information:

Dataset Name: Eng-Sinhala Translation Dataset
Size: Approximately 80,000 lines of English-Sinhala translation pairs
Dataset Link: Eng-Sinhala Translation Dataset
Data License: GPL (GNU General Public License)

I have recently created a dataset consisting of English-Sinhala translation pairs to address the lack of Sinhala knowledge in your chat language model. The dataset aims to improve the model's ability to engage in conversations and provide accurate responses in Sinhala.

Considering the large size of the dataset, I have already trained a separate language model using this dataset, which has yielded promising results. However, I believe that incorporating this dataset into your existing chat language model will further enhance its Sinhala language understanding and make it more valuable for users who communicate in Sinhala.

I kindly request your assistance in training your chat language model with the provided Eng-Sinhala Translation Dataset. By doing so, you can help expand the model's language capabilities and make it more inclusive for Sinhala-speaking users.

I understand that training a language model requires expertise and computational resources, which might not be readily available to me. Therefore, I seek your support and collaboration in this endeavor.

I am happy to provide the dataset files (src.txt and tgt.txt) to facilitate the training process. Additionally, the dataset is available for download from the provided link, ensuring transparency and easy access for other researchers and developers interested in Sinhala language processing.

https://huggingface.co/datasets/zaanind/sinhala_englsih_parrel_corpus

Files:

src.txt: This file contains the source sentences in English. Each line corresponds to an English sentence.
tgt.txt: This file contains the target sentences in Sinhala. Each line corresponds to the Sinhala translation of the corresponding English sentence in src.txt.

Hugging Chat org

Hi! That's interesting, but we didn't train the model ourselves. The model currently in use is this one: OpenAssistant/oasst-sft-6-llama-30b.

ok, i will try to inform them :) thanks for reply

nsarrazin changed discussion status to closed

Sign up or log in to comment