Love the work and have a Suggestion

#1
by Lyte - opened

Hello there! I hope you are having a wonderful day (whoever is reading this).

I stumbled upon your work the other day, and I am honestly very excited to see more and more people picking up and building/paving the way for the Moroccan language (Darija). I see that you are using the OpenAI API, and while this is a great idea, it's also an expensive one.

Today, upon finding that MetaAI has released v2 of the foundation models for translation called SeamlessM4T, it sparked the idea for me to bring it to your attention. I had previously run a small test using v1 to translate English to Moroccan Darija to study the results. While it wasn't of great quality, I find that it can save a lot of time and money for me. Of course, this was done on both Kaggle's free GPUs and on Colab, and it worked great. I got 10k rows of data in no time on both platforms, and I only had to do some cleaning to get a semi-working dataset out of it. So, I believe with the release of this new v2 model, you can use it to build and achieve your goals (which align with mine) even faster and cheaper. Hopefully, this can be of help to your team.

Cheers.

Some examples I made at the time of writing this (zero-shot, not cherry-picked):

Screenshot of Translation <a href=#1">
Screenshot of Translation #2

This comment has been hidden
Mixed Arabic Datasets org

Hello there! I hope you are having a wonderful day (whoever is reading this).

I stumbled upon your work the other day, and I am honestly very excited to see more and more people picking up and building/paving the way for the Moroccan language (Darija). I see that you are using the OpenAI API, and while this is a great idea, it's also an expensive one.

Today, upon finding that MetaAI has released v2 of the foundation models for translation called SeamlessM4T, it sparked the idea for me to bring it to your attention. I had previously run a small test using v1 to translate English to Moroccan Darija to study the results. While it wasn't of great quality, I find that it can save a lot of time and money for me. Of course, this was done on both Kaggle's free GPUs and on Colab, and it worked great. I got 10k rows of data in no time on both platforms, and I only had to do some cleaning to get a semi-working dataset out of it. So, I believe with the release of this new v2 model, you can use it to build and achieve your goals (which align with mine) even faster and cheaper. Hopefully, this can be of help to your team.

Cheers.

Some examples I made at the time of writing this (zero-shot, not cherry-picked):

Screenshot of Translation <a href=#1">
Screenshot of Translation #2

Hi there, first thank you for your interest in this work. Indeed, this latest model SeamlessM4T is a quite good and we are actually putting it under some heavy testing to see how it we can get the most out of it. These translations you provided are good, however one small problem that we faced was related to darija named entities and some difficult darija words ( but again we are still testing it further.. ๐Ÿ˜…).
If you are interested in collaborating with us, you are welcome to join the discord server and we'd love to work together.

Hello there! I hope you are having a wonderful day (whoever is reading this).

I stumbled upon your work the other day, and I am honestly very excited to see more and more people picking up and building/paving the way for the Moroccan language (Darija). I see that you are using the OpenAI API, and while this is a great idea, it's also an expensive one.

Today, upon finding that MetaAI has released v2 of the foundation models for translation called SeamlessM4T, it sparked the idea for me to bring it to your attention. I had previously run a small test using v1 to translate English to Moroccan Darija to study the results. While it wasn't of great quality, I find that it can save a lot of time and money for me. Of course, this was done on both Kaggle's free GPUs and on Colab, and it worked great. I got 10k rows of data in no time on both platforms, and I only had to do some cleaning to get a semi-working dataset out of it. So, I believe with the release of this new v2 model, you can use it to build and achieve your goals (which align with mine) even faster and cheaper. Hopefully, this can be of help to your team.

Cheers.

Some examples I made at the time of writing this (zero-shot, not cherry-picked):

Screenshot of Translation <a href=#1">
Screenshot of Translation #2

Hi there, first thank you for your interest in this work. Indeed, this latest model SeamlessM4T is a quite good and we are actually putting it under some heavy testing to see how it we can get the most out of it. These translations you provided are good, however one small problem that we faced was related to darija named entities and some difficult darija words ( but again we are still testing it further.. ๐Ÿ˜…).
If you are interested in collaborating with us, you are welcome to join the discord server and we'd love to work together.

Indeed, I faced a similar problem when I tried the first version of the model. It wasn't a good solution for a large dataset, but it was usable for a small one. I used it to create a small dataset, which I added to the one I had already worked with.

To address the issue of "erroneous" words, I compiled a list of words that the model consistently got wrong and replaced them with "Darija-correct" alternatives. This approach yielded promising results, and I ended up hand-curating the dataset. I am almost done with it and plan to use it soon to train a translation model.

The release of this new model has given me even more hope to create a high-quality dataset. From what I've observed, this model shows significant improvements in providing fewer "erroneous" words. Although it requires some effort (which I've been doing all by myself), we could potentially use a similar approach of creating a list of "erroneous" words.

I am very interested in joining your Discord, although the link on your page doesn't work.

Mixed Arabic Datasets org

Hey, maybe try this link https://discord.gg/nahaVstK

Mixed Arabic Datasets org

@Lyte joining efforts is incredible and this is why we all love OS ๐Ÿ”ฅ consider joining us since we all shoot at the same goal ๐Ÿ’ฏ

Hey, maybe try this link https://discord.gg/nahaVstK

yep, thanks that works.

Sign up or log in to comment