AndyChiang commited on
Commit
55bc6a0
·
1 Parent(s): 47d74f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md CHANGED
@@ -1,3 +1,100 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language: en
4
+ tags:
5
+ - bert
6
+ - cloze
7
+ - distractor
8
+ - generation
9
+ datasets:
10
+ - dgen
11
+ widget:
12
+ - text: "The only known planet with large amounts of water is [MASK]. [SEP] earth"
13
+ - text: "The products of photosynthesis are glucose and [MASK] else. [SEP] oxygen"
14
  ---
15
+
16
+ # cdgp-csg-scibert-dgen
17
+
18
+ ## Model description
19
+
20
+ This model is a Candidate Set Generator in **"CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model", Findings of EMNLP 2022**.
21
+
22
+ Its input are stem and answer, and output is candidate set of distractors. It is fine-tuned by [**DGen**](https://github.com/DRSY/DGen) dataset based on [**allenai/scibert_scivocab_uncased**](https://huggingface.co/allenai/scibert_scivocab_uncased) model.
23
+
24
+ For more details, you can see our **paper** or [**GitHub**](https://github.com/AndyChiangSH/CDGP).
25
+
26
+ ## How to use?
27
+
28
+ 1. Download model by hugging face transformers.
29
+ ```python
30
+ from transformers import BertTokenizer, BertForMaskedLM, pipeline
31
+
32
+ tokenizer = BertTokenizer.from_pretrained("AndyChiang/cdgp-csg-scibert-dgen")
33
+ csg_model = BertForMaskedLM.from_pretrained("AndyChiang/cdgp-csg-scibert-dgen")
34
+ ```
35
+
36
+ 2. Create a unmasker.
37
+ ```python
38
+ unmasker = pipeline("fill-mask", tokenizer=tokenizer, model=csg_model, top_k=10)
39
+ ```
40
+
41
+ 3. Use the unmasker to generate the candidate set of distractors.
42
+ ```python
43
+ sent = "The only known planet with large amounts of water is [MASK]. [SEP] earth"
44
+ cs = unmasker(sent)
45
+ print(cs)
46
+ ```
47
+
48
+ ## Dataset
49
+
50
+ This model is fine-tuned by [DGen](https://github.com/DRSY/DGen) dataset, which covers multiple domains including science, vocabulary, common sense and trivia. It is compiled from a wide variety of datasets including SciQ, MCQL, AI2 Science Questions, etc. The detail of DGen dataset is shown below.
51
+
52
+ | DGen dataset | Train | Valid | Test | Total |
53
+ | ----------------------- | ----- | ----- | ---- | ----- |
54
+ | **Number of questions** | 2321 | 300 | 259 | 2880 |
55
+
56
+ You can also use the [dataset](https://github.com/AndyChiangSH/CDGP/blob/main/datasets/DGen.zip) we have already cleaned.
57
+
58
+ ## Training
59
+
60
+ We use a special way to fine-tune model, which is called **"Answer-Relating Fine-Tune"**. More details are in our paper.
61
+
62
+ ### Training hyperparameters
63
+
64
+ The following hyperparameters were used during training:
65
+
66
+ - Pre-train language model: [allenai/scibert_scivocab_uncased](https://huggingface.co/allenai/scibert_scivocab_uncased)
67
+ - Optimizer: adam
68
+ - Learning rate: 0.0001
69
+ - Max length of input: 64
70
+ - Batch size: 64
71
+ - Epoch: 1
72
+ - Device: NVIDIA® Tesla T4 in Google Colab
73
+
74
+ ## Testing
75
+
76
+ The evaluations of this model as a Candidate Set Generator in CDGP is as follows:
77
+
78
+ | P@1 | F1@3 | MRR | NDCG@10 |
79
+ | ----- | ----- | ----- | ------- |
80
+ | 13.13 | 12.23 | 25.12 | 34.17 |
81
+
82
+ ## Other models
83
+
84
+ ### Candidate Set Generator
85
+
86
+ | Models | CLOTH | DGen |
87
+ | ----------- | ----------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
88
+ | **BERT** | [cdgp-csg-bert-cloth](https://huggingface.co/AndyChiang/cdgp-csg-bert-cloth) | [cdgp-csg-bert-dgen](https://huggingface.co/AndyChiang/cdgp-csg-bert-dgen) |
89
+ | **SciBERT** | [cdgp-csg-scibert-cloth](https://huggingface.co/AndyChiang/cdgp-csg-scibert-cloth) | [*cdgp-csg-scibert-dgen*](https://huggingface.co/AndyChiang/cdgp-csg-scibert-dgen) |
90
+ | **RoBERTa** | [cdgp-csg-roberta-cloth](https://huggingface.co/AndyChiang/cdgp-csg-roberta-cloth) | [cdgp-csg-roberta-dgen](https://huggingface.co/AndyChiang/cdgp-csg-roberta-dgen) |
91
+ | **BART** | [cdgp-csg-bart-cloth](https://huggingface.co/AndyChiang/cdgp-csg-bart-cloth) | [cdgp-csg-bart-dgen](https://huggingface.co/AndyChiang/cdgp-csg-bart-dgen) |
92
+
93
+ ### Distractor Selector
94
+
95
+ **fastText**: [cdgp-ds-fasttext](https://huggingface.co/AndyChiang/cdgp-ds-fasttext)
96
+
97
+
98
+ ## Citation
99
+
100
+ None