pipesanma's picture
update version models
0ebd40f
---
license: apache-2.0
datasets:
- squad
language:
- en
---
# Question Generator
This model should be used to generate questions based on a given string.
### Out-of-Scope Use
English language support only.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
def question_parser(question: str) -> str:
return " ".join(question.split(":")[1].split())
def generate_questions_v2(context: str, answer: str, n_questions: int = 1):
model = T5ForConditionalGeneration.from_pretrained(
"pipesanma/chasquilla-question-generator"
)
tokenizer = T5Tokenizer.from_pretrained("pipesanma/chasquilla-question-generator")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
text = "context: " + context + " " + "answer: " + answer + " </s>"
encoding = tokenizer.encode_plus(
text, max_length=512, padding=True, return_tensors="pt"
)
input_ids, attention_mask = encoding["input_ids"].to(device), encoding[
"attention_mask"
].to(device)
model.eval()
beam_outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_length=72,
early_stopping=True,
num_beams=5,
num_return_sequences=n_questions,
)
questions = []
for beam_output in beam_outputs:
sent = tokenizer.decode(
beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(sent)
questions.append(question_parser(sent))
return questions
context = "President Donald Trump said and predicted that some states would reopen this month."
answer = "Donald Trump"
questions = generate_questions_v2(context, answer, 1)
print(questions)
```
## Training Details
### Dataset generation
The dataset is "squad" from datasets library.
Check the [utils/dataset_gen.py](utils/dataset_gen.py) file for the dataset generation.
### Training model
Check the [utils/t5_train_model.py](utils/t5_train_model.py) file for the training process
### Model and Tokenizer versions
(v1.0) Model and Tokenizer V1: trained with 1000 rows
(v1.1) Model and Tokenizer V2: trained with 3000 rows
(v1.2) Model and Tokenizer V3: trained with all rows from datasets (78664 rows-train, 9652 rows-validation)