vaishali commited on
Commit
122d618
1 Parent(s): 2b605fd

Update README.md

Browse files

Add example script

Files changed (1) hide show
  1. README.md +92 -0
README.md CHANGED
@@ -1,3 +1,95 @@
1
  ---
 
 
 
 
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ tags:
4
+ - multitabqa
5
+ - multi-table-question-answering
6
  license: mit
7
  ---
8
+
9
+ # MultiTabQA (base-sized model)
10
+
11
+ MultiTabQA was proposed in [MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering](https://arxiv.org/abs/2305.12820) by Vaishali Pal, Andrew Yates, Evangelos Kanoulas, Maarten de Rijke. The original repo can be found [here](https://github.com/kolk/MultiTabQA).
12
+
13
+ ## Model description
14
+
15
+ MultiTabQA is a tableQA model which generates the answer table from multiple-input tables. It can handle multi-table operators such as UNION, INTERSECT, EXCEPT, JOINS, etc.
16
+
17
+ MultiTabQA is based on the TAPEX(BART) architecture, which is a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder.
18
+
19
+ ## Intended Uses
20
+
21
+ You can use the raw model SQL execution over multiple input tables. The model has been finetuned on Spider dataset where it answers natural language questions over multiple input tables.
22
+
23
+ ### How to Use
24
+
25
+ Here is how to use this model in transformers:
26
+
27
+ ```python
28
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
29
+ import pandas as pd
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained("vaishali/multitabqa-base")
32
+ model = AutoModelForSeq2SeqLM.from_pretrained("vaishali/multitabqa-base")
33
+
34
+ question = "How many departments are led by heads who are not mentioned?"
35
+ table_names = ['department', 'management']
36
+ tables=[{"columns":["Department_ID","Name","Creation","Ranking","Budget_in_Billions","Num_Employees"],
37
+ "index":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14],
38
+ "data":[
39
+ [1,"State","1789",1,9.96,30266.0],
40
+ [2,"Treasury","1789",2,11.1,115897.0],
41
+ [3,"Defense","1947",3,439.3,3000000.0],
42
+ [4,"Justice","1870",4,23.4,112557.0],
43
+ [5,"Interior","1849",5,10.7,71436.0],
44
+ [6,"Agriculture","1889",6,77.6,109832.0],
45
+ [7,"Commerce","1903",7,6.2,36000.0],
46
+ [8,"Labor","1913",8,59.7,17347.0],
47
+ [9,"Health and Human Services","1953",9,543.2,67000.0],
48
+ [10,"Housing and Urban Development","1965",10,46.2,10600.0],
49
+ [11,"Transportation","1966",11,58.0,58622.0],
50
+ [12,"Energy","1977",12,21.5,116100.0],
51
+ [13,"Education","1979",13,62.8,4487.0],
52
+ [14,"Veterans Affairs","1989",14,73.2,235000.0],
53
+ [15,"Homeland Security","2002",15,44.6,208000.0]
54
+ ]
55
+ },
56
+ {"columns":["department_ID","head_ID","temporary_acting"],
57
+ "index":[0,1,2,3,4],
58
+ "data":[
59
+ [2,5,"Yes"],
60
+ [15,4,"Yes"],
61
+ [2,6,"Yes"],
62
+ [7,3,"No"],
63
+ [11,10,"No"]
64
+ ]
65
+ }]
66
+
67
+ input_tables = [pd.read_json(table, orient="split") for table in tables]
68
+
69
+ # flatten the model inputs in the format: query + " " + <table_name> : table_name1 + flattened_table1 + <table_name> : table_name2 + flattened_table2 + ...
70
+ #flattened_input = question + " " + [f"<table_name> : {table_name} linearize_table(table) for table_name, table in zip(table_names, tables)]
71
+ model_input_string = """How many departments are led by heads who are not mentioned? <table_name> : department col : Department_ID | Name | Creation | Ranking | Budget_in_Billions | Num_Employees row 1 : 1 | State | 1789 | 1 | 9.96 | 30266 row 2 : 2 | Treasury | 1789 | 2 | 11.1 | 115897 row 3 : 3 | Defense | 1947 | 3 | 439.3 | 3000000 row 4 : 4 | Justice | 1870 | 4 | 23.4 | 112557 row 5 : 5 | Interior | 1849 | 5 | 10.7 | 71436 row 6 : 6 | Agriculture | 1889 | 6 | 77.6 | 109832 row 7 : 7 | Commerce | 1903 | 7 | 6.2 | 36000 row 8 : 8 | Labor | 1913 | 8 | 59.7 | 17347 row 9 : 9 | Health and Human Services | 1953 | 9 | 543.2 | 67000 row 10 : 10 | Housing and Urban Development | 1965 | 10 | 46.2 | 10600 row 11 : 11 | Transportation | 1966 | 11 | 58.0 | 58622 row 12 : 12 | Energy | 1977 | 12 | 21.5 | 116100 row 13 : 13 | Education | 1979 | 13 | 62.8 | 4487 row 14 : 14 | Veterans Affairs | 1989 | 14 | 73.2 | 235000 row 15 : 15 | Homeland Security | 2002 | 15 | 44.6 | 208000 <table_name> : management col : department_ID | head_ID | temporary_acting row 1 : 2 | 5 | Yes row 2 : 15 | 4 | Yes row 3 : 2 | 6 | Yes row 4 : 7 | 3 | No row 5 : 11 | 10 | No"""
72
+ inputs = tokenizer(model_input_string, return_tensors="pt")
73
+
74
+ outputs = model.generate(**inputs)
75
+
76
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
77
+ # 'col : count(*) row 1 : 11'
78
+ ```
79
+
80
+ ### How to Fine-tuning
81
+
82
+ Please find the fine-tuning script [here](https://github.com/kolk/MultiTabQA).
83
+
84
+ ### BibTeX entry and citation info
85
+
86
+ ```bibtex
87
+ @misc{pal2023multitabqa,
88
+ title={MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering},
89
+ author={Vaishali Pal and Andrew Yates and Evangelos Kanoulas and Maarten de Rijke},
90
+ year={2023},
91
+ eprint={2305.12820},
92
+ archivePrefix={arXiv},
93
+ primaryClass={cs.CL}
94
+ }
95
+ ```