German Training Data

#3
by WANGYIWEI - opened

Servus Oliver :))

Thanks very much for your great work.
I checked your source code on Github and I found the CML you used to fine-tune the German spell correction model:

python run_summarization.py \
    --model_name_or_path facebook/mbart-large-50 \
    --do_train \
    --do_eval \
    --train_file de.train.csv \
    --validation_file de.test.csv \
    --output_dir ./models/mbart-large-50-spelling-de/ \
    ......
    --lang="de"

I just wanted to ask about the German training data, de.train.csv, as I am interested in scaling up the base model a bit, with flan-t5-large. Would you like to talk a little bit about the availability of the training data?

Hey @WANGYIWEI ,
basically, I took the training data from Leipzig Corpora Collection. They provide clean sentences in different languages for various languages. I placed the files that I used in this folder. If you like to scale up the model, you can add there more files from the Leipzig Corpora or other sources.

The process to create a data set in the format of the "de.train.csv" file:

  1. Place text files for your language into "data/raw" folder
  2. Combine the single files into one big file with this script.
  3. Run the generate_dataset.py to generate the training file. You have the adjust the path and language in the python script in this line.

Please let me know if you have some other questions or some results to share :)

Best,
Oliver

Hi Oliver,

Thanks for the feedback. I have slightly modified your script locally to make it more suitable for generating German data. It worked perfectly well and spared a lot of time to do the data engineering job. I will start the first round optimisation experiment by the end of this week with flan-t5-large.

Also regarding the corruption methods I might have something interesting to share. I have worked on GEC previously, yet for English grammar error detection (binary classification). So, I have utilised Google's C4_200M Synthetic Dataset to train the classifier. They introduced some linguistic methods to better reflect the true distribution of error types from actual users. And I think it's really logical and practical.

Therefore, I might work in this small direction if it doesn't take too much effort and will keep you updated.

Best regards,
Yiwei Wang

Sign up or log in to comment