File size: 2,467 Bytes
cd3f9d0
 
 
 
 
 
 
 
 
96fb662
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd3f9d0
 
 
90a124a
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
license: mit
language:
- de
---
# German text simplification with custom decoder
This model was initialized from an mBART model and the decoder was replaced by a GPT2 language model pre-trained for German Easy Language. For more details, visit our [Github repository](https://github.com/MiriUll/Language-Models-German-Simplification).

## Usage
```python
import torch
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("josh-oo/custom-decoder-ats")

##gerpt

#model = AutoModelForSeq2SeqLM.from_pretrained("josh-oo/custom-decoder-ats", trust_remote_code=True, revision="35197269f0235992fcc6b8363ca4f48558b624ff")
#decoder_tokenizer = AutoTokenizer.from_pretrained("josh-oo/gerpt2")

##dbmdz

model = AutoModelForSeq2SeqLM.from_pretrained("josh-oo/custom-decoder-ats", trust_remote_code=True, revision="4accedbe0b57d342d95ff546b6bbd3321451d504")
decoder_tokenizer = AutoTokenizer.from_pretrained("josh-oo/german-gpt2-easy")
decoder_tokenizer.add_tokens(['<</s>>','<<s>>','<<pad>>'])

##

example_text = "In tausenden Schweizer Privathaushalten kümmern sich Haushaltsangestellte um die Wäsche, betreuen die Kinder und sorgen für Sauberkeit. Durchschnittlich bekommen sie für die Arbeit rund 30 Franken pro Stunde Bruttolohn. Der grösste Teil von ihnen erhält aber 28 Franken."

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

test_input = tokenizer([example_text], return_tensors="pt", padding=True, pad_to_multiple_of=1024)
for key, value in test_input.items():
  test_input[key] = value.to(device)

outputs = model.generate(**test_input, num_beams=3, max_length=1024)
decoder_tokenizer.batch_decode(outputs)
```

## Citation
If you use our mode, please cite:   
@inproceedings{anschutz-etal-2023-language,  
&emsp;  title = "Language Models for {G}erman Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training",  
&emsp;  author = {Ansch{\"u}tz, Miriam  and Oehms, Joshua  and Wimmer, Thomas  and Jezierski, Bart{\l}omiej  and Groh, Georg},  
&emsp;  booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",  
&emsp;  month = jul,  
&emsp;  year = "2023",  
&emsp;  address = "Toronto, Canada",  
&emsp;  publisher = "Association for Computational Linguistics",  
&emsp;  url = "https://aclanthology.org/2023.findings-acl.74",  
&emsp;  pages = "1147--1158",  
}