laihuiyuan commited on
Commit
04d186b
1 Parent(s): 336314e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # Paper
8
+ This is an mBART-based model that can be used for both multilingual DRS parsing and DRS-to-text generation, covering four languages (English:EN, German:DE,
9
+ Italian:IT, Dutch:NL). It is introduced in the paper [Pre-Trained Language-Meaning Models for Multilingual Parsing and Generation](https://arxiv.org/abs/2306.00124).
10
+
11
+
12
+ # Abstract
13
+ Pre-trained language models (PLMs) have achieved great success in NLP and have recently been used for tasks in computational semantics. However, these tasks do not fully benefit from PLMs since meaning representations are not explicitly included in the pre-training stage. We introduce multilingual pre-trained language-meaning models based on Discourse Representation Structures (DRSs), including meaning representations besides natural language texts in the same model, and design a new strategy to reduce the gap between the pre-training and fine-tuning objectives. Since DRSs are language neutral, cross-lingual transfer learning is adopted to further improve the performance of non-English tasks. Automatic evaluation results show that our approach achieves the best performance on both the multilingual DRS parsing and DRS-to-text generation tasks. Correlation analysis between automatic metrics and human judgements on the generation task further validates the effectiveness of our model. Human inspection reveals that out-of-vocabulary tokens are the main cause of erroneous results.
14
+
15
+ # How to use
16
+ ```bash
17
+ git clone https://github.com/wangchunliu/DRS-pretrained-LMM.git
18
+ cd DRS-pretrained-LMM
19
+ ```
20
+
21
+ ```python
22
+ # a case of drs-text generation
23
+ from tokenization_mlm import MLMTokenizer
24
+ from transformers import MBartForConditionalGeneration
25
+
26
+ # For DRS parsing, src_lang should be set to en_XX, de_DE, it_IT, or nl_XX
27
+ tokenizer = MLMTokenizer.from_pretrained('laihuiyuan/DRS-LMM', src_lang='<drs>')
28
+ model = MBartForConditionalGeneration.from_pretrained('laihuiyuan/DRS-LMM')
29
+
30
+ # gold text: The court is adjourned until 3:00 p.m. on March 1st.
31
+ inp_ids = tokenizer.encode(
32
+ "court.n.01 time.n.08 EQU now adjourn.v.01 Theme -2 Time -1 Finish +1 time.n.08 ClockTime 15:00 MonthOfYear 3 DayOfMonth 1",
33
+ return_tensors="pt")
34
+
35
+ # For DRS parsing, the forced bos token here should be <drs>
36
+ foced_ids = tokenizer.encode("en_XX", add_special_tokens=False, return_tensors="pt")
37
+ outs = model.generate(input_ids=inp_ids, forced_bos_token_id=foced_ids.item(), num_beams=5, max_length=150)
38
+ text = tokenizer.decode(outs[0].tolist(), skip_special_tokens=True, clean_up_tokenization_spaces=False)
39
+ ```
40
+
41
+ # Citation Info
42
+ ```BibTeX
43
+ @inproceedings{wang-etal-2023-pre,
44
+ title = "Pre-Trained Language-Meaning Models for Multilingual Parsing and Generation",
45
+ author = "Wang, Chunliu and Lai, Huiyuan and Nissim, Malvina and Bos, Johan",
46
+ booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
47
+ month = July,
48
+ year = "2023",
49
+ address = "Toronto, Canada",
50
+ publisher = "Association for Computational Linguistics",
51
+ }
52
+ ```