pszemraj/flan-t5-large-grammar-synthesis · which dataset split that used for training?

Hi! Thanks for the additions :)

To answer the question, JFLEG is a part of the dataset used to fine-tune the models, but it's much more than that - I've written some details here (if this one doesn't have all of it, I recommend checking the other closed discussions too as I discussed more details/stats on the dataset etc), but basically it's a lot of augmentations on both val and test of JFLEG and some other custom data, which after several steps in my pipeline I then finally split back into train/test/validation, then train the text2text model on that.

I was/am waiting until I had a chance to create an improved version of the dataset, compute some metrics, write a short paper or blog post about it, and then release the dataset. Unfortunately, I have been too busy to get to that, but hopefully later this year! When I do release an improved version, I plan to drop JFLEG altogether so that it can be released under a much more permissive license.

BTW - Grammarly released some models and a paper about their custom dataset and method earlier this year, you might be interested in that. Unfortunately AFAIK it the dataset itself has not been released yet, I have an open issue here

Hope this helps!