A Model for multiple questions generation for french and english languages

Training

The model has been trained on different french and english corpus (FQuAD, PIAF and SQuAD) where for each paragraph the objective is to predict all the possible questions. We are using the mbart model facebook/mbart-large-50-many-to-many-mmt, notice that it would works better with MBarthez like models for the french generation (MBarthez finetuning will be uploaded later). For all dataset we translate queries (in french or in english) but we always preserve paragraph in its original language, thus the model can create english questions with french paragraph. We trained the model during 12 epochs considering an epochs as 2000 batches of 128 (considering gradient accumulation).

Generate

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

access_token = "hf_......"

# Loading the model weights and the tokenizer
tokenizer = AutoTokenizer.from_pretrained("ThomasGerald/mbart-multi-question-generation", use_auth_token=access_token)
model = AutoModelForSeq2SeqLM.from_pretrained("ThomasGerald/mbart-multi-question-generation", use_auth_token=access_token)

# For the exemple we give the following text talking about origin of the grec language 
text = ("La recherche moderne considère généralement que la langue grecque n'est pas née en Grèce," +
   "mais elle n'est pas arrivée à un consensus quant à la date d'arrivée des groupes parlant un "+
   "« proto-grec », qui s'est produite durant des phases préhistoriques pour lesquelles il n'y a"+
   "pas de texte indiquant quelles langues étaient parlées. Les premiers textes écrits en grec sont"+
   "les tablettes en linéaire B de l'époque mycénienne, au XIVe siècle av. J.-C., ce qui indique que"+
   "des personnes parlant un dialecte grec sont présentes en Grèce au plus tard durant cette période."+
   " La linguistique n'est pas en mesure de trancher, pas plus que l'archéologie.")

# We specify the input languages and tokenize
tokenizer.set_src_lang_special_tokens("fr_XX")
tokenized_text = tokenizer([text], return_tensors="pt")
########## FRENCH DECODING
# We generate given the output language code
output = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.lang_code_to_id['fr_XX'])

# We show the outpu of the model (sequence of questions separated by the token <question_sep>)
tokenizer.batch_decode(output, skip_special_tokens=False)
#### output : 
'''["</s>fr_XX Quelle est l'origine de la langue grecque selon la recherche moderne?<question_sep>
Quelle est la date d'arrivée des grecs proto-grecs?<question_sep> Où sont les premiers textes
écrits en grec?<question_sep> De quand date l'époque mycénienne?<question_sep>
Qu'est ce que la linguistique n'est pas en mesure de faire?</s>"]
'''
######### ENGLISH DECODING
# We generate given the output language code
output = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.lang_code_to_id['en_XX'])

# We show the outpu of the model (sequence of questions separated by the token <question_sep>)
tokenizer.batch_decode(output, skip_special_tokens=False)
#### output : 
'''["</s>en_XX What is the origin of the Greek language according to modern research?
<question_sep> When did the prehistoric phases take place?<question_sep> What are the
first texts written in Greek?<question_sep> When do the first written texts in Greek
date back?<question_sep> What is the only science that can't decide the dates of the
first Greek texts?</s>"]
'''

ThomasGerald
/

mbart-multi-question-generation

You need to agree to share your contact information to access this model

A Model for multiple questions generation for french and english languages

Training

Generate