|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- squad |
|
language: |
|
- en |
|
--- |
|
|
|
|
|
# Question Generator |
|
|
|
This model should be used to generate questions based on a given string. |
|
|
|
### Out-of-Scope Use |
|
|
|
English language support only. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
import torch |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
def question_parser(question: str) -> str: |
|
return " ".join(question.split(":")[1].split()) |
|
|
|
def generate_questions_v2(context: str, answer: str, n_questions: int = 1): |
|
model = T5ForConditionalGeneration.from_pretrained( |
|
"pipesanma/chasquilla-question-generator" |
|
) |
|
tokenizer = T5Tokenizer.from_pretrained("pipesanma/chasquilla-question-generator") |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
text = "context: " + context + " " + "answer: " + answer + " </s>" |
|
|
|
encoding = tokenizer.encode_plus( |
|
text, max_length=512, padding=True, return_tensors="pt" |
|
) |
|
input_ids, attention_mask = encoding["input_ids"].to(device), encoding[ |
|
"attention_mask" |
|
].to(device) |
|
|
|
model.eval() |
|
beam_outputs = model.generate( |
|
input_ids=input_ids, |
|
attention_mask=attention_mask, |
|
max_length=72, |
|
early_stopping=True, |
|
num_beams=5, |
|
num_return_sequences=n_questions, |
|
) |
|
|
|
questions = [] |
|
|
|
for beam_output in beam_outputs: |
|
sent = tokenizer.decode( |
|
beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True |
|
) |
|
print(sent) |
|
questions.append(question_parser(sent)) |
|
|
|
return questions |
|
|
|
|
|
context = "President Donald Trump said and predicted that some states would reopen this month." |
|
answer = "Donald Trump" |
|
|
|
questions = generate_questions_v2(context, answer, 1) |
|
print(questions) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Dataset generation |
|
|
|
The dataset is "squad" from datasets library. |
|
|
|
Check the [utils/dataset_gen.py](utils/dataset_gen.py) file for the dataset generation. |
|
|
|
### Training model |
|
|
|
Check the [utils/t5_train_model.py](utils/t5_train_model.py) file for the training process |
|
|
|
### Model and Tokenizer versions |
|
|
|
(v1.0) Model and Tokenizer V1: trained with 1000 rows |
|
|
|
|
|
(v1.1) Model and Tokenizer V2: trained with 3000 rows |
|
|
|
(v1.2) Model and Tokenizer V3: trained with all rows from datasets (78664 rows-train, 9652 rows-validation) |