mt5-small-qgen / README.md
lbourdois's picture
Add multilingual to the language tag
25d6af1
metadata
language:
  - en
  - hi
  - de
  - ar
  - bn
  - fi
  - ja
  - zh
  - id
  - sw
  - ta
  - gr
  - ru
  - es
  - th
  - tr
  - vi
  - multilingual
datasets:
  - squad_v2
  - tydiqa
  - mlqa
  - xquad
  - germanquad
widget:
  - text: >-
      Hugging Face has seen rapid growth in its popularity since the get-go. It
      is definitely doing the right things to attract more and more people to
      its platform, some of which are on the following lines: Community driven
      approach through large open source repositories along with paid services.
      Helps to build a network of like-minded people passionate about open
      source. Attractive price point. The subscription-based features, e.g.:
      Inference based API, starts at a price of $9/month.
    example_title: English
  - text: >-
      A un a�o y tres d�as de que el bal�n ruede en el Al Bayt Stadium
      inaugurando el Mundial 2022, ya se han dibujado los primeros bocetos de la
      pr�xima Copa del Mundo.13 selecciones est�n colocadas en el mapa con la
      etiqueta de clasificadas y tienen asegurado pisar los verdes de Qatar en
      la primera fase final  oto�al. Serbia, Dinamarca, Espa�a, Pa�ses Bajos,
      Suiza, Croacia, Francia, Inglaterra, B�lgica, Alemania, Brasil, Argentina
      y Qatar, como anfitriona, entrar�n en   el sorteo del 1 de abril de 2022
      en Doha en el que 32 pa�ses ser�n repartidos en sus respectivos grupos. 
    example_title: Spanish

Multi-lingual Question Generating Model (mt5-small)

Give the model a passage and it will generate a question about the passage.

Trained on the following datasets:

Training details

I used flax summarization script and a TPU v3-8. Summarization expects a text column and a summary column. For question generation training, use the context column instead of text column and question instead of summary column.

Limitations and Intended Use

There is no guarantee that it will produce a question in the language of the passage, but it usually does. Lower resource languages will likely have lower quality questions.

Intended use is to make questions given a passage. With a larger model this might be able to generate training data for question-answering models, but this small one does not produce high-quality questions.

Using the model

PyTorch version

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
tokenizer = AutoTokenizer.from_pretrained("nbroad/mt5-small-qgen")
model = AutoModelForSeq2SeqLM.from_pretrained("nbroad/mt5-small-qgen")

text = "Hugging Face has seen rapid growth in its \npopularity since the get-go. It is definitely doing\n the right things to attract more and more people to \n its platform, some of which are on the following lines:\nCommunity driven approach through large open source repositories \nalong with paid services. Helps to build a network of like-minded\n people passionate about open source. \nAttractive price point. The subscription-based features, e.g.: \nInference based API, starts at a price of $9/month.\n"

inputs = tokenizer(text, return_tensors="pt")
output = model.generate(**inputs, max_length=40)

tokenizer.decode(output[0], skip_special_tokens=True)
# What is the subscription-based features that starts at a price of $/month'

Model trained on Cloud TPUs from Google's TPU Research Cloud (TRC)