janpase97 commited on
Commit
abff794
1 Parent(s): 486d8d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md CHANGED
@@ -1,3 +1,53 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
  ---
4
+
5
+ # MQDD - Multimodal Question Duplicity Detection
6
+
7
+ This repository publishes trained models and other supporting materials for the paper
8
+ [MQDD – Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain](https://arxiv.org/abs/2203.14093). For more information, see the paper.
9
+ The Stack Overflow Datasets (SOD) and Stack Overflow Duplicity Dataset (SODD) presented in the paper can be obtained from our [Stack Overflow Dataset repository](https://github.com/kiv-air/StackOverflowDataset).
10
+
11
+ To acquire the pre-trained model only, see the [UWB-AIR/MQDD-pretrained](https://huggingface.co/UWB-AIR/MQDD-pretrained).
12
+
13
+ ## Fine-tuned MQDD
14
+
15
+ We release a fine-tuned version of our MQDD model for duplicate detection task. The model's architecture follows the architecture of a two-tower model as depicted in the figure below:
16
+
17
+ <img src="img/architecture.png" width="700">
18
+
19
+ A self-standing encoder without a duplicate detection head can be loaded using the following source code snippet. Such a model can be used for building search systems based, for example, on [Faiss](https://github.com/facebookresearch/faiss) library.
20
+
21
+ ```Python
22
+ from transformers import AutoTokenizer, AutoModel
23
+
24
+ tokenizer = AutoTokenizer.from_pretrained("UWB-AIR/MQDD-duplicates")
25
+ model = AutoModel.from_pretrained("UWB-AIR/MQDD-duplicates")
26
+ ```
27
+
28
+ A checkpoint of a full two-tower model can than be obtained from our [GoogleDrive folder](https://drive.google.com/drive/folders/1CYiqF2GJ2fSQzx_oM4-X_IhpObi4af5Q?usp=sharing). To load the model, one needs to use the model's implementation from `models/MQDD_model.py` in our [GitHub repository](https://github.com/kiv-air/MQDD). To construct the model and load it's checkpoint, use the following source code:
29
+
30
+ ```Python
31
+ from MQDD_model import ClsHeadModelMQDD
32
+
33
+ model = ClsHeadModelMQDD("UWB-AIR/MQDD-duplicates")
34
+ ckpt = torch.load("model.pt", map_location="cpu")
35
+ model.load_state_dict(ckpt["model_state"])
36
+ ```
37
+
38
+ ## Licence
39
+ This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/
40
+
41
+ ## How should I cite the MQDD?
42
+ For now, please cite [the Arxiv paper](https://arxiv.org/abs/2203.14093):
43
+ ```
44
+ @misc{https://doi.org/10.48550/arxiv.2203.14093,
45
+ doi = {10.48550/ARXIV.2203.14093},
46
+ url = {https://arxiv.org/abs/2203.14093},
47
+ author = {Pašek, Jan and Sido, Jakub and Konopík, Miloslav and Pražák, Ondřej},
48
+ title = {MQDD -- Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain},
49
+ publisher = {arXiv},
50
+ year = {2022},
51
+ copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
52
+ }
53
+ ```