File size: 6,537 Bytes
fa778e5
 
 
bd56e01
 
 
fa778e5
 
9b6d20d
 
 
 
 
 
 
 
 
 
 
 
9b15226
 
 
75ee931
 
 
 
9477969
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75ee931
 
 
 
 
9477969
75ee931
9477969
75ee931
9477969
 
 
 
 
 
 
 
75ee931
 
9477969
75ee931
9477969
75ee931
 
 
9477969
75ee931
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34b6dc2
75ee931
 
 
 
 
 
 
cbda07a
 
 
75ee931
 
 
 
 
 
b347b07
75ee931
 
 
 
9477969
 
7c7f877
 
 
 
57fed75
 
 
 
 
 
 
 
75ee931
 
 
 
 
34b6dc2
75ee931
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbda07a
 
f08147b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
datasets:
- spider
- spider-Syn
metrics:
- exact_match
language:
- en
results:
  - task:
      type: text-2-sql
      name: Text to SQL
    dataset:
      type: spider
      name: Spider
      split: validation
    metrics:
    - type: exact_match
      value: 0.492

pipeline_tag: text2text-generation
tags:
- text2sql
---

# T5 large LM Adapt for Text to SQL

This model is purposed to generate structured SQL queries from the natural-language prompts.

### Intro

In the Text2SQL task, the model learns how to generate a SQL query based on the question posed in natural language. 
However, in some cases, the SQL query contains unknown columns etc., and altogether does not take the schema of the specific database into account. 

That is where our approach comes in. 
We incorporated the database schema into the input question while training to specify which columns and relations are available to generate an applicable SQL query.

The exposition of database schema, together with the prompt, allows the model to learn the mapping of the schema to the expected output.
This allows the model to better generalize to the schemas that were not present in the training data.

### Base model

We fine-tune this model from the [t5-large-LM-adapt](https://huggingface.co/google/t5-large-lm-adapt) checkpoint.

## Spider and Spider-Syn dataset

The model was fine-tuned on the training splits of [Spider](https://yale-lily.github.io/spider) and [Spider-Syn](https://github.com/ygan/Spider-Syn/tree/main/Spider-Syn) datasets. Instead of using only the questions, we added the database schema to the question, as we wanted the model to generate a question over a given database

_Input prompt_:

```python
Question:  What is the average, minimum, and maximum age for all French musicians?
Schema: "stadium" "Stadium_ID" int , "Location" text , "Name" text , "Capacity" int , "Highest" int , "Lowest" int ,
        "Average" int , foreign_key:  primary key: "Stadium_ID" [SEP] "singer" "Singer_ID" int , "Name" text , "Country" text ,
        "Song_Name" text , "Song_release_year" text , "Age" int , "Is_male" bool ,
        foreign_key:  primary key: "Singer_ID" [SEP],
        "concert" "concert_ID" int , "concert_Name" text , "Theme" text , "Year" text , foreign_key: "Stadium_ID" text from "stadium",
        "Stadium_ID" , primary key: "concert_ID" [SEP] "singer_in_concert",
        foreign_key: "concert_ID" int from "concert",
        "concert_ID" , "Singer_ID" text from "singer" "Singer_ID" , primary key: "concert_ID" "Singer_ID"
```

_Expected output_:

```sql
SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'
```

When evaluating the output, we query the _SQLite_ database and get:
```
[[34.5, 25, 43]]
```

## Format of the database schema

The standardized database schema the model was trained on:

```
table_name column1_name column1_type column2_name column2_type ... foreign_key: FK_name FK_type from table_name column_name primary key: column_name [SEP]
table_name2 ...
```

## Usage

Here is how to use this model to answer the question on a given context using 🤗 Transformers in PyTorch:

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = 'gaussalgo/T5-LM-Large-text2sql-spider'
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

question = "What is the average, minimum, and maximum age for all French musicians?"
schema = """
   "stadium" "Stadium_ID" int , "Location" text , "Name" text , "Capacity" int , "Highest" int , "Lowest" int , "Average" int , foreign_key:  primary key: "Stadium_ID" [SEP] "singer" "Singer_ID" int , "Name" text , "Country" text , "Song_Name" text , "Song_release_year" text , "Age" int , "Is_male" bool , foreign_key:  primary key: "Singer_ID" [SEP] "concert" "concert_ID" int , "concert_Name" text , "Theme" text , "Year" text , foreign_key: "Stadium_ID" text from "stadium" "Stadium_ID" , primary key: "concert_ID" [SEP] "singer_in_concert"  foreign_key: "concert_ID" int from "concert" "concert_ID" , "Singer_ID" text from "singer" "Singer_ID" , primary key: "concert_ID" "Singer_ID"
"""

input_text = " ".join(["Question: ",question, "Schema:", schema])

model_inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**model_inputs, max_length=512)

output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("SQL Query:")
print(output_text)
```
outputs:
```sql
SQL Query:
SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'
```

## Evaluation
Evaluation was done on the dev split of the Spider and Spider-syn dataset. The databases present in the dev split have no intersection with the databases of the train split.
This way we ensure, that the model was not exposed to the evaluated databases during training.
The evaluation was done by comparing the results of querying the database using the generated query and reference.
Both Spider and Spider-Syn dev splits have 1032 samples.
* **Spider dev accuracy:** 49.2%
* **Spider Syn dev accuracy:** 39.5%


## Training

The model has been trained using [Adaptor library](https://github.com/gaussalgo/adaptor) 0.2.1, on training splits of Spider and Spider-syn datasets with the following parameters:

```python
training_arguments = AdaptationArguments(output_dir="train_dir",
                                         learning_rate=5e-5,
                                         stopping_strategy=StoppingStrategy.ALL_OBJECTIVES_CONVERGED,
                                         stopping_patience=8,
                                         save_total_limit=8,
                                         do_train=True,
                                         do_eval=True,
                                         bf16=True,
                                         warmup_steps=1000,
                                         gradient_accumulation_steps=8,
                                         logging_steps=10,
                                         eval_steps=200,
                                         save_steps=1000,
                                         num_train_epochs=10,
                                         evaluation_strategy="steps")

```
The training is fairly easy to reproduce, but we do not wish to publish modified copies of the Spider datasets that it depends on. 
If you'd like to investigate further in this direction, feel free to get in touch through a new PR, or via email to stefanik(at)gaussalgo.com.