File size: 2,427 Bytes
08ce52c
 
e4e0a7a
 
 
 
b1863ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa004b8
 
0ebd40f
fa004b8
 
0ebd40f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
datasets:
- squad
language:
- en
---


# Question Generator

This model should be used to generate questions based on a given string.

### Out-of-Scope Use

English language support only.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

def question_parser(question: str) -> str:
    return " ".join(question.split(":")[1].split())

def generate_questions_v2(context: str, answer: str, n_questions: int = 1):
    model = T5ForConditionalGeneration.from_pretrained(
        "pipesanma/chasquilla-question-generator"
    )
    tokenizer = T5Tokenizer.from_pretrained("pipesanma/chasquilla-question-generator")

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    text = "context: " + context + " " + "answer: " + answer + " </s>"

    encoding = tokenizer.encode_plus(
        text, max_length=512, padding=True, return_tensors="pt"
    )
    input_ids, attention_mask = encoding["input_ids"].to(device), encoding[
        "attention_mask"
    ].to(device)

    model.eval()
    beam_outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=72,
        early_stopping=True,
        num_beams=5,
        num_return_sequences=n_questions,
    )

    questions = []

    for beam_output in beam_outputs:
        sent = tokenizer.decode(
            beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
        print(sent)
        questions.append(question_parser(sent))

    return questions


context = "President Donald Trump said and predicted that some states would reopen this month."
answer = "Donald Trump"

questions = generate_questions_v2(context, answer, 1)
print(questions)
```

## Training Details

### Dataset generation

The dataset is "squad" from datasets library.

Check the [utils/dataset_gen.py](utils/dataset_gen.py) file for the dataset generation.

### Training model

Check the [utils/t5_train_model.py](utils/t5_train_model.py) file for the training process

### Model and Tokenizer versions

(v1.0) Model and Tokenizer V1: trained with 1000 rows 


(v1.1) Model and Tokenizer V2: trained with 3000 rows

(v1.2) Model and Tokenizer V3: trained with all rows from datasets (78664 rows-train, 9652 rows-validation)