AndyChiang commited on
Commit
c089b52
1 Parent(s): f6c3237

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md CHANGED
@@ -1,3 +1,102 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language: en
4
+ tags:
5
+ - bert
6
+ - cloze
7
+ - distractor
8
+ - generation
9
+ datasets:
10
+ - cloth
11
+ widget:
12
+ - text: "I feel [MASK] now. [SEP] happy"
13
+ - text: "The old man was waiting for a ride across the [MASK]. [SEP] river"
14
  ---
15
+
16
+ # cdgp-csg-scibert-cloth
17
+
18
+ ## Model description
19
+
20
+ This model is a Candidate Set Generator in **"CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model", Findings of EMNLP 2022**.
21
+
22
+ Its input are stem and answer, and output is candidate set of distractors. It is fine-tuned by [**CLOTH**](https://www.cs.cmu.edu/~glai1/data/cloth/) dataset based on [**allenai/scibert_scivocab_uncased**](https://huggingface.co/allenai/scibert_scivocab_uncased) model.
23
+
24
+ For more details, you can see our **paper** or [**GitHub**](https://github.com/AndyChiangSH/CDGP).
25
+
26
+ ## How to use?
27
+
28
+ 1. Download the model by hugging face transformers.
29
+ ```python
30
+ from transformers import BertTokenizer, BertForMaskedLM, pipeline
31
+
32
+ tokenizer = BertTokenizer.from_pretrained("AndyChiang/cdgp-csg-scibert-cloth")
33
+ csg_model = BertForMaskedLM.from_pretrained("AndyChiang/cdgp-csg-scibert-cloth")
34
+ ```
35
+
36
+ 2. Create a unmasker.
37
+ ```python
38
+ unmasker = pipeline("fill-mask", tokenizer=tokenizer, model=csg_model, top_k=10)
39
+ ```
40
+
41
+ 3. Use the unmasker to generate the candidate set of distractors.
42
+ ```python
43
+ sent = "I feel [MASK] now. [SEP] happy"
44
+ cs = unmasker(sent)
45
+ print(cs)
46
+ ```
47
+
48
+ ## Dataset
49
+
50
+ This model is fine-tuned by [CLOTH](https://www.cs.cmu.edu/~glai1/data/cloth/) dataset, which is a collection of nearly 100,000 cloze questions from middle school and high school English exams. The detail of CLOTH dataset is shown below.
51
+
52
+ | Number of questions | Train | Valid | Test |
53
+ | ------------------- | ----- | ----- | ----- |
54
+ | Middle school | 22056 | 3273 | 3198 |
55
+ | High school | 54794 | 7794 | 8318 |
56
+ | Total | 76850 | 11067 | 11516 |
57
+
58
+ You can also use the [dataset](https://github.com/AndyChiangSH/CDGP/blob/main/datasets/CLOTH.zip) we have already cleaned.
59
+
60
+ ## Training
61
+
62
+ We use a special way to fine-tune model, which is called **"Answer-Relating Fine-Tune"**. More detail is in our paper.
63
+
64
+ ### Training hyperparameters
65
+
66
+ The following hyperparameters were used during training:
67
+
68
+ - Pre-train language model: [allenai/scibert_scivocab_uncased](https://huggingface.co/allenai/scibert_scivocab_uncased)
69
+ - Optimizer: adam
70
+ - Learning rate: 0.0001
71
+ - Max length of input: 64
72
+ - Batch size: 64
73
+ - Epoch: 1
74
+ - Device: NVIDIA® Tesla T4 in Google Colab
75
+
76
+ ## Testing
77
+
78
+ The evaluations of this model as a Candidate Set Generator in CDGP is as follows:
79
+
80
+ | P@1 | F1@3 | F1@10 | MRR | NDCG@10 |
81
+ | ---- | ---- | ----- | ----- | ------- |
82
+ | 8.10 | 9.13 | 12.22 | 19.53 | 28.76 |
83
+
84
+ ## Other models
85
+
86
+ ### Candidate Set Generator
87
+
88
+ | Models | CLOTH | DGen |
89
+ | ----------- | ----------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
90
+ | **BERT** | [cdgp-csg-bert-cloth](https://huggingface.co/AndyChiang/cdgp-csg-bert-cloth) | [cdgp-csg-bert-dgen](https://huggingface.co/AndyChiang/cdgp-csg-bert-dgen) |
91
+ | **SciBERT** | [*cdgp-csg-scibert-cloth*](https://huggingface.co/AndyChiang/cdgp-csg-scibert-cloth) | [cdgp-csg-scibert-dgen](https://huggingface.co/AndyChiang/cdgp-csg-scibert-dgen) |
92
+ | **RoBERTa** | [cdgp-csg-roberta-cloth](https://huggingface.co/AndyChiang/cdgp-csg-roberta-cloth) | [cdgp-csg-roberta-dgen](https://huggingface.co/AndyChiang/cdgp-csg-roberta-dgen) |
93
+ | **BART** | [cdgp-csg-bart-cloth](https://huggingface.co/AndyChiang/cdgp-csg-bart-cloth) | [cdgp-csg-bart-dgen](https://huggingface.co/AndyChiang/cdgp-csg-bart-dgen) |
94
+
95
+ ### Distractor Selector
96
+
97
+ **fastText**: [cdgp-ds-fasttext](https://huggingface.co/AndyChiang/cdgp-ds-fasttext)
98
+
99
+
100
+ ## Citation
101
+
102
+ None