AutoTrain documentation

Extractive Question Answering

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.8.24).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Extractive Question Answering

Extractive Question Answering is a task in which a model is trained to extract the answer to a question from a given context. The model is trained to predict the start and end positions of the answer span within the context. This task is commonly used in question-answering systems to extract relevant information from a large corpus of text.

Preparing your data

To train an Extractive Question Answering model, you need a dataset that contains the following columns:

  • text: The context or passage from which the answer is to be extracted.
  • question: The question for which the answer is to be extracted.
  • answer: The start position of the answer span in the context.

Here is an example of how your dataset should look:

{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?","answers":{"text":["Saint Bernadette Soubirous"],"answer_start":[515]}}
{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"What is in front of the Notre Dame Main Building?","answers":{"text":["a copper statue of Christ"],"answer_start":[188]}}
{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"The Basilica of the Sacred heart at Notre Dame is beside to which structure?","answers":{"text":["the Main Building"],"answer_start":[279]}}

Note: the preferred format for question answering is JSONL, if you want to use CSV, the answer column should be stringified JSON with the keys text and answer_start.

Example dataset from Hugging Face Hub: lhoestq/squad

P.S. You can use both squad and squad v2 data format with correct column mappings.

< > Update on GitHub