language:
- de
metrics:
- sari
- bleu
- bertscore
library_name: transformers
pipeline_tag: text2text-generation
tags:
- text simplification
- plain language
- easy-to-read language
- sentence simplification
Model Card for mt5-simple-german-corpus
This model aims to simplify German texts into plain German language. It belongs to the experiments done at the work of Stodden (2024, to appear). "Reproduction & Benchmarking of German Text Simplification Systems" In Proceedings of the 1st Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt!), Turin, Italy.
Model Details
Model Description
- Developed by: Regina Stodden
- Model type: Text2Text Generation
- Language(s) (NLP): German, Plain German, Easy-to-Read German
- License: [More Information Needed]
- Finetuned from model [optional]: https://huggingface.co/google/mt5-base
Model Sources [optional]
- Repository: https://huggingface.co/DEplain/mt5-simple-german-corpus
- Paper [optional]: Stodden (2024, to appear). "Reproduction & Benchmarking of German Text Simplification Systems" In Proceedings of the 1st Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt!), Turin, Italy.
Uses
Direct Use & Downstream Use
mt5-simple-german-corpus is intended to be used to simplify German sentences for people with reading problems of German texts. mt5-simple-german-corpus is a fine-tuned version of mT5-base, which is fine-tuned on simple-german-corpus, a German text simplification corpus of the web domain. The intended use is sentence simplification of German, where the source language is standard German and the target language is plain or easy-to-read German.
Out-of-Scope Use
mt5-simple-german-corpus is fine-tuned on complex-simple pairs of the web domain (including mixed domains) and for different target groups, e.g., German learners, people with cognitive disabilities. Hence, we assume that the model will not work well for other use cases than text simplification, languages other than German, or other target groups than non-native speakers or people with cognitive disabilities.
Bias, Risks, and Limitations
The generated simplifications of the TS model might have some errors, therefore they shouldn’t be shown to a potentially vulnerable target group before manually verifying their quality and possibly fixing them. The text simplification system could be provided to human translators who might improve and timely reduce their effort in manually simplifying a text.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model. Please specify the maximum target length of the sequence to 128 to reproduce our results.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("DEplain/mt5-simple-german-corpus")
model = AutoModelForSeq2SeqLM.from_pretrained("DEplain/mt5-simple-german-corpus")
prefix = "Simplify to plain German: "
sent = "Ganz vorne im Gespann zieht er die anderen 13 Hunde mit, führt sie über vereiste Seen oder steile Berge und findet den Weg, wenn ihn selbst der Musher nicht mehr kennt."
# EN: "At the front of the team, he pulls the other 13 dogs along, leads them over icy lakes or steep mountains and finds the way when even the musher no longer knows it."
inputs = tokenizer([prefix+sent], return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
Training Details
Training Data
The model is fine-tuned on simple-german-corpus. simple-german-corpus (Toborek et al., 2023) is a dataset for the training and evaluation of sentence simplification in German. The simple-complex sentence pairs are automatically aligned.
Training Procedure
Training Hyperparameters
- Training regime: fp32
- epochs: 10
- model: mt5-base
- prefix: "simplify to plain German: "
- max length: 128:128
- learning rate: 0.001
- batch size: 4
- metric: SARI
- optimzer: adafactor
Evaluation
Testing Data, Factors & Metrics
Testing Data
We mainly recommend to evaluate mt5-simple-german-corpus on simple-german-corpus. However, in our paper, we include evaluation on more test sets which can be found here: https://github.com/rstodden/easse-de.
Metrics
All models are automatically evaluated against one reference and on the same evaluation metrics, i.e., SARI (Xu et al., 2016), BLEU (Papineni et al., 2002), BS_P (Zhang* et al., 2020), and FRE (Amstad, 1978). Following the recommendation of Alva-Manchego et al. (2021), we use BS_P as the main evaluation metric, if the score is a high we verify it with other metrics, i.e., SARI, BLEU and FRE. In addition, as recommended by Tanprasert and Kauchak (2021) and Alva-Manchego et al. (2019), we also report linguistic features to get more insights into the system-generated simplifications, i.e., compression ratio and sentence splits. For the measurement of the metrics and features, we are using the evaluation framework, i.e., EASSE-DE (Stodden, 2024) a multi-lingual adaptation of the EASSE evaluation framework.
Results
Results of mt5-simple-german-corpus and related models evaluated on simple-german-corpus. For more results on other test data, please have a look at our paper.
BLEU | SARI | BS_P | FRE | Compr. Ratio | Sent. Splits | |
---|---|---|---|---|---|---|
hda_LS | 6.34 | 20.22 | 0.25 | 41.15 | 1.00 | 1.03 |
sockeye-APA-LHA | 0.33 | 35.50 | 0.13 | 63.70 | 0.80 | 0.82 |
sockeye-DEplain-APA | 1.35 | 37.86 | 0.18 | 71.05 | 0.79 | 1.01 |
mBART-DEplain-APA | 5.70 | 32.77 | 0.31 | 58.15 | 0.97 | 1.00 |
mBART-DEplain-APA+web | 6.56 | 29.80 | 0.33 | 44.95 | 1.61 | 1.09 |
mT5-DEplain-APA | 2.81 | 35.92 | 0.30 | 51.45 | 0.76 | 0.88 |
mt5-SGC | 3.30 | 43.62 | 0.37 | 58.55 | 0.61 | 0.85 |
BLOOM-zero | 3.76 | 31.95 | 0.25 | 53.55 | 0.81 | 1.07 |
BLOOM-10-random | 4.64 | 33.16 | 0.30 | 51.50 | 0.75 | 0.92 |
BLOOM-10-similarity | 13.32 | 44.66 | 0.38 | 58.65 | 0.92 | 1.13 |
custom-decoder-ats | 0.44 | 36.53 | 0.06 | 32.05 | 8.83 | 3.68 |
Identity baseline | 7.46 | 6.51 | 0.29 | 41.15 | 1.00 | 1.00 |
Reference baseline | 100.00 | 100.00 | 1.00 | 65.40 | 1.25 | 1.81 |
Truncate baseline | 4.66 | 20.12 | 0.28 | 50.50 | 0.81 | 0.87 |
Citation [optional]
BibTeX:
@inproceedings{stodden-2024-reproduction,
author = {Regina Stodden},
title = {{Reproduction \& Benchmark of German Text Simplification Systems}},
booktitle = "Proceedings of the 1st Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt!)",
year = {2024 (to appear)},
address = "Turino, Italy"
}
APA:
Regina Stodden. 2024 (to appear). "Reproduction & Benchmarking of German Text Simplification Systems". In Proceedings of the 1st Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt!), Turin, Italy.
Model Card Contact
if you have any question, please contact Regina Stodden (regina.stodden@hhu.de).