which dataset split that used for training?

#11
by Marovlo - opened

Hello, thanks for this awsome model. I want to train a llama model to compare with this one. Could you passibly tell me which split of the JFLEG dataset that you used for training and which for evaluating since there is no official train split in JFLEG? Thank you!

Hi! Thanks for the additions :)

To answer the question, JFLEG is a part of the dataset used to fine-tune the models, but it's much more than that - I've written some details here (if this one doesn't have all of it, I recommend checking the other closed discussions too as I discussed more details/stats on the dataset etc), but basically it's a lot of augmentations on both val and test of JFLEG and some other custom data, which after several steps in my pipeline I then finally split back into train/test/validation, then train the text2text model on that.

I was/am waiting until I had a chance to create an improved version of the dataset, compute some metrics, write a short paper or blog post about it, and then release the dataset. Unfortunately, I have been too busy to get to that, but hopefully later this year! When I do release an improved version, I plan to drop JFLEG altogether so that it can be released under a much more permissive license.

BTW - Grammarly released some models and a paper about their custom dataset and method earlier this year, you might be interested in that. Unfortunately AFAIK it the dataset itself has not been released yet, I have an open issue here

Hope this helps!

Sign up or log in to comment