{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "l5Ds1ZM41KC9" }, "source": [ "## Introduction: TAPAS\n", "\n", "* Original TAPAS paper (ACL 2020): https://www.aclweb.org/anthology/2020.acl-main.398/\n", "* Follow-up paper on intermediate pre-training (EMMNLP Findings 2020): https://www.aclweb.org/anthology/2020.findings-emnlp.27/\n", "* Original Github repository: https://github.com/google-research/tapas\n", "* Blog post: https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html\n", "\n", "TAPAS is an algorithm that (among other tasks) can answer questions about tabular data. It is essentially a BERT model with relative position embeddings and additional token type ids that encode tabular structure, and 2 classification heads on top: one for **cell selection** and one for (optionally) performing an **aggregation** among selected cells (such as summing or counting).\n", "\n", "Similar to BERT, the base `TapasModel` is pre-trained using the masked language modeling (MLM) objective on a large collection of tables from Wikipedia and associated texts. In addition, the authors further pre-trained the model on an second task (table entailment) to increase the numerical reasoning capabilities of TAPAS (as explained in the follow-up paper), which further improves performance on downstream tasks.\n", "\n", "In this notebook, we are going to fine-tune `TapasForQuestionAnswering` on [Sequential Question Answering (SQA)](https://www.microsoft.com/en-us/research/publication/search-based-neural-structured-learning-sequential-question-answering/), a dataset built by Microsoft Research which deals with asking questions related to a table in a **conversational set-up**. We are going to do so as in the original paper, by adding a randomly initialized cell selection head on top of the pre-trained base model (note that SQA does not have questions that involve aggregation and hence no aggregation head), and then fine-tuning them altogether.\n", "\n", "First, we install both the Transformers library as well as the dependency on [`torch-scatter`](https://github.com/rusty1s/pytorch_scatter), which the model requires." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MUMrt5Ow_PEA", "outputId": "eda7d53e-9846-4941-ed72-ce84f495469f" }, "outputs": [], "source": [ "#! rm -r transformers\n", "#! git clone https://github.com/huggingface/transformers.git\n", "#! cd transformers\n", "#! pip install ./transformers" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gx4u09iTyRjY", "outputId": "e4cd9f4b-7d8d-4b47-e8b2-304b921dba98" }, "outputs": [], "source": [ "#! pip install torch-scatter==latest+cu101 -f https://pytorch-geometric.com/whl/torch-1.7.0.html" ] }, { "cell_type": "markdown", "metadata": { "id": "BSZfmBt0meYm" }, "source": [ "We also install a small portion from the SQA training dataset, for demonstration purposes. This is a TSV file containing table-question pairs. Besides this, we also download the `table_csv` directory, which contains the actual tabular data.\n", "\n", "Note that you can download the entire SQA dataset on the [official website](https://www.microsoft.com/en-us/download/details.aspx?id=54253)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wsuwgDEU4J_f" }, "outputs": [], "source": [ "import requests, zipfile, io\n", "import os\n", "\n", "def download_files(dir_name):\n", " if not os.path.exists(dir_name):\n", " # 28 training examples from the SQA training set + table csv data\n", " urls = [\"https://www.dropbox.com/s/2p6ez9xro357i63/sqa_train_set_28_examples.zip?dl=1\",\n", " \"https://www.dropbox.com/s/abhum8ssuow87h6/table_csv.zip?dl=1\"\n", " ]\n", " for url in urls:\n", " r = requests.get(url)\n", " z = zipfile.ZipFile(io.BytesIO(r.content))\n", " z.extractall()\n", "\n", "dir_name = \"sqa_data\"\n", "download_files(dir_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "EPrYJOn81f0D" }, "source": [ "## Prepare the data\n", "\n", "Let's look at the first few rows of the dataset:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 279 }, "id": "2X27wyd805D8", "outputId": "7ccfd32c-e8dd-4fec-c044-d7d8de8dd578" }, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "annotator | \n", "position | \n", "question | \n", "table_file | \n", "answer_coordinates | \n", "answer_text | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "nt-639 | \n", "0 | \n", "0 | \n", "where are the players from? | \n", "table_csv/203_149.csv | \n", "['(0, 4)', '(1, 4)', '(2, 4)', '(3, 4)', '(4, ... | \n", "['Louisiana State University', 'Valley HS (Las... | \n", "
1 | \n", "nt-639 | \n", "0 | \n", "1 | \n", "which player went to louisiana state university? | \n", "table_csv/203_149.csv | \n", "['(0, 1)'] | \n", "['Ben McDonald'] | \n", "
2 | \n", "nt-639 | \n", "1 | \n", "0 | \n", "who are the players? | \n", "table_csv/203_149.csv | \n", "['(0, 1)', '(1, 1)', '(2, 1)', '(3, 1)', '(4, ... | \n", "['Ben McDonald', 'Tyler Houston', 'Roger Salke... | \n", "
3 | \n", "nt-639 | \n", "1 | \n", "1 | \n", "which ones are in the top 26 picks? | \n", "table_csv/203_149.csv | \n", "['(0, 1)', '(1, 1)', '(2, 1)', '(3, 1)', '(4, ... | \n", "['Ben McDonald', 'Tyler Houston', 'Roger Salke... | \n", "
4 | \n", "nt-639 | \n", "1 | \n", "2 | \n", "and of those, who is from louisiana state univ... | \n", "table_csv/203_149.csv | \n", "['(0, 1)'] | \n", "['Ben McDonald'] | \n", "
\n", " | id | \n", "annotator | \n", "position | \n", "question | \n", "table_file | \n", "answer_coordinates | \n", "answer_text | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "nt-639 | \n", "0 | \n", "0 | \n", "where are the players from? | \n", "table_csv/203_149.csv | \n", "[(0, 4), (1, 4), (2, 4), (3, 4), (4, 4), (5, 4... | \n", "[Louisiana State University, Valley HS (Las Ve... | \n", "
1 | \n", "nt-639 | \n", "0 | \n", "1 | \n", "which player went to louisiana state university? | \n", "table_csv/203_149.csv | \n", "[(0, 1)] | \n", "[Ben McDonald] | \n", "
2 | \n", "nt-639 | \n", "1 | \n", "0 | \n", "who are the players? | \n", "table_csv/203_149.csv | \n", "[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1... | \n", "[Ben McDonald, Tyler Houston, Roger Salkeld, J... | \n", "
3 | \n", "nt-639 | \n", "1 | \n", "1 | \n", "which ones are in the top 26 picks? | \n", "table_csv/203_149.csv | \n", "[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1... | \n", "[Ben McDonald, Tyler Houston, Roger Salkeld, J... | \n", "
4 | \n", "nt-639 | \n", "1 | \n", "2 | \n", "and of those, who is from louisiana state univ... | \n", "table_csv/203_149.csv | \n", "[(0, 1)] | \n", "[Ben McDonald] | \n", "
5 | \n", "nt-639 | \n", "2 | \n", "0 | \n", "who are the players in the top 26? | \n", "table_csv/203_149.csv | \n", "[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1... | \n", "[Ben McDonald, Tyler Houston, Roger Salkeld, J... | \n", "
6 | \n", "nt-639 | \n", "2 | \n", "1 | \n", "of those, which one was from louisiana state u... | \n", "table_csv/203_149.csv | \n", "[(0, 1)] | \n", "[Ben McDonald] | \n", "
7 | \n", "nt-11649 | \n", "0 | \n", "0 | \n", "what are all the names of the teams? | \n", "table_csv/204_135.csv | \n", "[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1... | \n", "[Cordoba CF, CD Malaga, Granada CF, UD Las Pal... | \n", "
8 | \n", "nt-11649 | \n", "0 | \n", "1 | \n", "of these, which teams had any losses? | \n", "table_csv/204_135.csv | \n", "[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1... | \n", "[Cordoba CF, CD Malaga, Granada CF, UD Las Pal... | \n", "
9 | \n", "nt-11649 | \n", "0 | \n", "2 | \n", "of these teams, which had more than 21 losses? | \n", "table_csv/204_135.csv | \n", "[(15, 1)] | \n", "[CD Villarrobledo] | \n", "
\n", " | id | \n", "annotator | \n", "position | \n", "question | \n", "table_file | \n", "answer_coordinates | \n", "answer_text | \n", "sequence_id | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "nt-639 | \n", "0 | \n", "0 | \n", "where are the players from? | \n", "table_csv/203_149.csv | \n", "[(0, 4), (1, 4), (2, 4), (3, 4), (4, 4), (5, 4... | \n", "[Louisiana State University, Valley HS (Las Ve... | \n", "nt-639-0 | \n", "
1 | \n", "nt-639 | \n", "0 | \n", "1 | \n", "which player went to louisiana state university? | \n", "table_csv/203_149.csv | \n", "[(0, 1)] | \n", "[Ben McDonald] | \n", "nt-639-0 | \n", "
2 | \n", "nt-639 | \n", "1 | \n", "0 | \n", "who are the players? | \n", "table_csv/203_149.csv | \n", "[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1... | \n", "[Ben McDonald, Tyler Houston, Roger Salkeld, J... | \n", "nt-639-1 | \n", "
3 | \n", "nt-639 | \n", "1 | \n", "1 | \n", "which ones are in the top 26 picks? | \n", "table_csv/203_149.csv | \n", "[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1... | \n", "[Ben McDonald, Tyler Houston, Roger Salkeld, J... | \n", "nt-639-1 | \n", "
4 | \n", "nt-639 | \n", "1 | \n", "2 | \n", "and of those, who is from louisiana state univ... | \n", "table_csv/203_149.csv | \n", "[(0, 1)] | \n", "[Ben McDonald] | \n", "nt-639-1 | \n", "
\n", " | question | \n", "table_file | \n", "answer_coordinates | \n", "answer_text | \n", "
---|---|---|---|---|
sequence_id | \n", "\n", " | \n", " | \n", " | \n", " |
ns-1292-0 | \n", "[who are all the athletes?, where are they fro... | \n", "table_csv/204_521.csv | \n", "[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ... | \n", "[[Tommy Green, Janis Dalins, Ugo Frigerio, Kar... | \n", "
nt-10730-0 | \n", "[what was the production numbers of each revol... | \n", "table_csv/203_253.csv | \n", "[[(0, 4), (1, 4), (2, 4), (3, 4), (4, 4), (5, ... | \n", "[[1,900 (estimated), 14,500 (estimated), 6,000... | \n", "
nt-10730-1 | \n", "[what three revolver models had the least amou... | \n", "table_csv/203_253.csv | \n", "[[(0, 0), (6, 0), (7, 0)], [(0, 0)]] | \n", "[[Remington-Beals Army Model Revolver, New Mod... | \n", "
nt-10730-2 | \n", "[what are all of the remington models?, how ma... | \n", "table_csv/203_253.csv | \n", "[[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, ... | \n", "[[Remington-Beals Army Model Revolver, Remingt... | \n", "
nt-11649-0 | \n", "[what are all the names of the teams?, of thes... | \n", "table_csv/204_135.csv | \n", "[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ... | \n", "[[Cordoba CF, CD Malaga, Granada CF, UD Las Pa... | \n", "
nt-11649-1 | \n", "[what are the losses?, what team had more than... | \n", "table_csv/204_135.csv | \n", "[[(0, 6), (1, 6), (2, 6), (3, 6), (4, 6), (5, ... | \n", "[[6, 6, 9, 10, 10, 12, 12, 11, 13, 14, 15, 14,... | \n", "
nt-11649-2 | \n", "[what were all the teams?, what were the loss ... | \n", "table_csv/204_135.csv | \n", "[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ... | \n", "[[Cordoba CF, CD Malaga, Granada CF, UD Las Pa... | \n", "
nt-639-0 | \n", "[where are the players from?, which player wen... | \n", "table_csv/203_149.csv | \n", "[[(0, 4), (1, 4), (2, 4), (3, 4), (4, 4), (5, ... | \n", "[[Louisiana State University, Valley HS (Las V... | \n", "
nt-639-1 | \n", "[who are the players?, which ones are in the t... | \n", "table_csv/203_149.csv | \n", "[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ... | \n", "[[Ben McDonald, Tyler Houston, Roger Salkeld, ... | \n", "
nt-639-2 | \n", "[who are the players in the top 26?, of those,... | \n", "table_csv/203_149.csv | \n", "[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ... | \n", "[[Ben McDonald, Tyler Houston, Roger Salkeld, ... | \n", "
\n", " | Rank | \n", "Name | \n", "Nationality | \n", "Time (hand) | \n", "Notes | \n", "
---|---|---|---|---|---|
0 | \n", "nan | \n", "Tommy Green | \n", "Great Britain | \n", "4:50:10 | \n", "OR | \n", "
1 | \n", "nan | \n", "Janis Dalins | \n", "Latvia | \n", "4:57:20 | \n", "nan | \n", "
2 | \n", "nan | \n", "Ugo Frigerio | \n", "Italy | \n", "4:59:06 | \n", "nan | \n", "
3 | \n", "4.0 | \n", "Karl Hahnel | \n", "Germany | \n", "5:06:06 | \n", "nan | \n", "
4 | \n", "5.0 | \n", "Ettore Rivolta | \n", "Italy | \n", "5:07:39 | \n", "nan | \n", "
5 | \n", "6.0 | \n", "Paul Sievert | \n", "Germany | \n", "5:16:41 | \n", "nan | \n", "
6 | \n", "7.0 | \n", "Henri Quintric | \n", "France | \n", "5:27:25 | \n", "nan | \n", "
7 | \n", "8.0 | \n", "Ernie Crosbie | \n", "United States | \n", "5:28:02 | \n", "nan | \n", "
8 | \n", "9.0 | \n", "Bill Chisholm | \n", "United States | \n", "5:51:00 | \n", "nan | \n", "
9 | \n", "10.0 | \n", "Alfred Maasik | \n", "Estonia | \n", "6:19:00 | \n", "nan | \n", "
10 | \n", "nan | \n", "Henry Cieman | \n", "Canada | \n", "nan | \n", "DNF | \n", "
11 | \n", "nan | \n", "John Moralis | \n", "Greece | \n", "nan | \n", "DNF | \n", "
12 | \n", "nan | \n", "Francesco Pretti | \n", "Italy | \n", "nan | \n", "DNF | \n", "
13 | \n", "nan | \n", "Arthur Tell Schwab | \n", "Switzerland | \n", "nan | \n", "DNF | \n", "
14 | \n", "nan | \n", "Harry Hinkel | \n", "United States | \n", "nan | \n", "DNF | \n", "
\n", " | Actors | \n", "Age | \n", "Number of movies | \n", "Date of birth | \n", "
---|---|---|---|---|
0 | \n", "Brad Pitt | \n", "56 | \n", "87 | \n", "7 february 1967 | \n", "
1 | \n", "Leonardo Di Caprio | \n", "45 | \n", "53 | \n", "10 june 1996 | \n", "
2 | \n", "George Clooney | \n", "59 | \n", "69 | \n", "28 november 1967 | \n", "