Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

A MBARTHEZ MODEL TRAINED FOR QUESTION GENERATION

Training

The model has been trained on different french and english corpus (FQuAD, PIAF and SQuAD)

Generate

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Getting the data
access_token = "hf_......"
tokenizer = AutoTokenizer.from_pretrained("ThomasGerald/MBARTHEZ-QG", use_auth_token=access_token)
model = AutoModelForSeq2SeqLM.from_pretrained("ThomasGerald/MBARTHEZ-QG", use_auth_token=access_token)

# text input exemple notice we use the token <hl> to delimite the support of the question
text = ("La recherche moderne considère généralement que la langue grecque n'est pas née en Grèce," +
   "mais elle n'est pas arrivée à un consensus quant à la date d'arrivée des groupes parlant un "+
   "« proto-grec », qui s'est produite durant des phases préhistoriques pour lesquelles il n'y a"+
   "pas de texte indiquant quelles langues étaient parlées. Les premiers textes écrits en grec <hl>sont"+
   "les tablettes en linéaire B de l'époque mycénienne<hl>, au XIVe siècle av. J.-C., ce qui indique que"+
   "des personnes parlant un dialecte grec sont présentes en Grèce au plus tard durant cette période."+
   " La linguistique n'est pas en mesure de trancher, pas plus que l'archéologie.")

tokenized_text = tokenizer([text], return_tensors="pt")

# Output conditionnaly to the language (here two tokens possible '[fr_XX]' and '[en_XX]')
output_ids = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.convert_tokens_to_ids(['[fr_XX]']))

# Decoding
output = tokenizer.batch_decode(output_ids, skip_special_tokens=False)

# output:
'''['</s>[fr_XX] Quels sont les premiers textes écrits en grec?</s>']'''

We can also generate question in english from french context by specifying the begining of sentence token ('[en_XX]'). Considering the previous code prepending the following one we can generate english questions executing :

output_ids = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.convert_tokens_to_ids(['[en_XX]']))
output = tokenizer.batch_decode(output_idsskip_special_tokens=False)

# output:
'''['</s>[en_XX] What are the first texts written in grec?</s>']'''

Of course you can also generate questions from english text :

# text input exemple notice we use the token <hl> to delimite the support of the question
text = ("By 371 BC, Thebes was in the ascendancy, defeating Sparta at" +
        "<hl>the Battle of Leuctra<hl>, killing the Spartan king Cleombrotus I" +
        ", and invading Laconia. Further Theban successes against Sparta" +
        "in 369 led to Messenia gaining independence; Sparta never recovered" +
        "from the loss of Messenia's fertile land and the helot workforce it" +
        "provided.[50] The rising power of Thebes led Sparta and Athens to join" +
        "forces; in 362 they were defeated by Thebes at the Battle of Mantinea." +
        " In the aftermath of Mantinea, none of the major Greek states were able" +
        "to dominate. Though Thebes had won the battle, their general Epaminondas" +
        "was killed, and they spent the following decades embroiled in wars with"+
        "their neighbours; Athens, meanwhile, saw its second naval alliance," + 
        " formed in 377, collapse in the mid-350s.")

tokenized_text = tokenizer([text], return_tensors="pt")

# French question
output_ids = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.convert_tokens_to_ids(['[fr_XX]']))

# Decoding
output = tokenizer.batch_decode(output_ids, skip_special_tokens=False)

# Notice it does not translate "Sparta" which is "Sparte" in french 
'''['</s>[fr_XX] À quelle bataille Sparta a-t-il été vaincu par Thebes?</s>']'''

# English question
output_ids = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.convert_tokens_to_ids(['[en_XX]']))

# Decoding
output = tokenizer.batch_decode(output_ids, skip_special_tokens=False)

'''['</s>[en_XX] At what battle did Thebes defeat Sparta?</s>']'''
Downloads last month
0