PM-AI commited on
Commit
4b00424
1 Parent(s): 774fee6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ - en
5
+ pipeline_tag: feature-extraction
6
+ tags:
7
+ - semantic textual similarity
8
+ - sts
9
+ - semantic search
10
+ - sentence similarity
11
+ - paraphrasing
12
+ - documents retrieval
13
+ - passage retrieval
14
+ - information retrieval
15
+ - sentence-transformer
16
+ - feature-extraction
17
+ - transformers
18
+ task_categories:
19
+ - sentence-similarity
20
+ - feature-extraction
21
+ - text-retrieval
22
+ - other
23
+ library_name: sentence-transformers
24
+ ---
25
+
26
+ # Model card for PM-AI/paraphrase-distilroberta-base-v2_de-en
27
+ For internal purposes and for testing, we have made a monolingual paraphrasing model from Sentence Transformers usable for _German + English_ via [Knowledge Distillation](https://arxiv.org/abs/2004.09813).
28
+ The decision was made in favor of [sentence-transformers/paraphrase-distilroberta-base-v2](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v2) because this model has not public available multilingual version (to our knowledge).
29
+ In addition, it has a significantly more training samples compared to its predecessor: 83.3 million samples were used instead of 24.6 million samples.
30
+
31
+ ## Training
32
+ 1) Download of datasets
33
+ 2) Execution of knowledge distillation
34
+
35
+ ### Training Data
36
+ Datasets used based on [offical source](https://www.sbert.net/examples/training/paraphrases/README.html):
37
+ - _AllNLI_
38
+ - _sentence-compression_
39
+ - _SimpleWiki_
40
+ - _altlex_
41
+ - _msmarco-triplets_
42
+ - _quora_duplicates_
43
+ - _coco_captions_
44
+ - _flickr30k_captions_
45
+ - _yahoo_answers_title_question_
46
+ - _S2ORC_citation_pairs_
47
+ - _stackexchange_duplicate_questions_
48
+ - _wiki-atomic-edits_
49
+
50
+ ### Training Execution
51
+
52
+ First we downloaded some german-english parallel datasets via [get_parallel_data_*.py](https://github.com/UKPLab/sentence-transformers/tree/b86eec31cf0a102ad786ba1ff31bfeb4998d3ca5/examples/training/multilingual).
53
+
54
+ These datasets are: _Tatoeba_, _WikiMatrix_, _TED2020_, _OpenSubtitles_, _Europarl_, _News-Commentary_
55
+
56
+ Then we started knowledge distillation with [make_multilingual_sys.py](https://github.com/UKPLab/sentence-transformers/blob/b86eec31cf0a102ad786ba1ff31bfeb4998d3ca5/examples/training/multilingual/make_multilingual_sys.py)
57
+
58
+ #### Parameterization of training
59
+ - **Script:** [make_multilingual_sys.py](https://github.com/UKPLab/sentence-transformers/blob/b86eec31cf0a102ad786ba1ff31bfeb4998d3ca5/examples/training/multilingual/make_multilingual_sys.py)
60
+ - **Datasets:** Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
61
+ - **GPU:** NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
62
+ - **Batch Size:** 64
63
+ - **Max Sequence Length:** 256
64
+ - **Train Max Sentence Length:** 600
65
+ - **Max Sentences Per Train File:** 1000000
66
+ - **Teacher Model:** [sentence-transformers/paraphrase-distilroberta-base-v2](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v2)
67
+ - **Student Model:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
68
+ - **Loss Function:** MSE Loss
69
+ - **Learning Rate:** 2e-5
70
+ - **Epochs:** 20
71
+ - **Evaluation Steps:** 10000
72
+ - **Warmup Steps:** 10000
73
+
74
+ ### Acknowledgment
75
+
76
+ This work is a collaboration between [Technical University of Applied Sciences Wildau (TH Wildau)](https://en.th-wildau.de/) and [sense.ai.tion GmbH](https://senseaition.com/).
77
+ You can contact us via:
78
+ * [Philipp Müller (M.Eng.)](https://www.linkedin.com/in/herrphilipps); Author
79
+ * [Prof. Dr. Janett Mohnke](mailto:icampus@th-wildau.de); TH Wildau
80
+ * [Dr. Matthias Boldt, Jörg Oehmichen](mailto:info@senseaition.com); sense.AI.tion GmbH
81
+
82
+ This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".
83
+
84
+ <div style="display:flex">
85
+ <div style="padding-left:20px;">
86
+ <a href="https://efre.brandenburg.de/efre/de/"><img src="https://huggingface.co/datasets/PM-AI/germandpr-beir/resolve/main/res/EFRE-Logo_rechts_oweb_en_rgb.jpeg" alt="Logo of European Regional Development Fund (EFRE)" width="200"/></a>
87
+ </div>
88
+ <div style="padding-left:20px;">
89
+ <a href="https://www.senseaition.com"><img src="https://senseaition.com/wp-content/uploads/thegem-logos/logo_c847aaa8f42141c4055d4a8665eb208d_3x.png" alt="Logo of senseaition GmbH" width="200"/></a>
90
+ </div>
91
+ <div style="padding-left:20px;">
92
+ <a href="https://www.th-wildau.de"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/TH_Wildau_Logo.png/640px-TH_Wildau_Logo.png" alt="Logo of TH Wildau" width="180"/></a>
93
+ </div>
94
+ </div>