--- license: apache-2.0 datasets: - squad language: - en --- # Question Generator This model should be used to generate questions based on a given string. ### Out-of-Scope Use English language support only. ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from transformers import T5ForConditionalGeneration, T5Tokenizer def question_parser(question: str) -> str: return " ".join(question.split(":")[1].split()) def generate_questions_v2(context: str, answer: str, n_questions: int = 1): model = T5ForConditionalGeneration.from_pretrained( "pipesanma/chasquilla-question-generator" ) tokenizer = T5Tokenizer.from_pretrained("pipesanma/chasquilla-question-generator") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) text = "context: " + context + " " + "answer: " + answer + " " encoding = tokenizer.encode_plus( text, max_length=512, padding=True, return_tensors="pt" ) input_ids, attention_mask = encoding["input_ids"].to(device), encoding[ "attention_mask" ].to(device) model.eval() beam_outputs = model.generate( input_ids=input_ids, attention_mask=attention_mask, max_length=72, early_stopping=True, num_beams=5, num_return_sequences=n_questions, ) questions = [] for beam_output in beam_outputs: sent = tokenizer.decode( beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True ) print(sent) questions.append(question_parser(sent)) return questions context = "President Donald Trump said and predicted that some states would reopen this month." answer = "Donald Trump" questions = generate_questions_v2(context, answer, 1) print(questions) ``` ## Training Details ### Dataset generation The dataset is "squad" from datasets library. Check the [utils/dataset_gen.py](utils/dataset_gen.py) file for the dataset generation. ### Training model Check the [utils/t5_train_model.py](utils/t5_train_model.py) file for the training process ### Model and Tokenizer versions (v1.0) Model and Tokenizer V1: trained with 1000 rows (v1.1) Model and Tokenizer V2: trained with 3000 rows (v1.2) Model and Tokenizer V3: trained with all rows from datasets (78664 rows-train, 9652 rows-validation)