datasets:
- spider
language:
- en
pipeline_tag: text2text-generation
tags:
- text2sql
T5 large LM Adapt for Text to SQL
This model is fine-tuned from the t5-large-LM-adapt checkpoint. While training the model on Text2SQL task, the model learns how to generate a SQL query based on the question posed in natural language, however in some cases the SQL query contains unknown columns etc. and altogether does not take the schema of the specific database into account. That is where our approach comes in. We incorporated the database schema into the input question while training, to specify which columns and relations are available to generate an applicable SQL query.
Spider and Spider-Syn dataset
The model was fine-tuned on the training splits of Spider and Spider-Syn datasets. Instead of using only the questions, we added the database schema to the question, as we wanted the model to generate a question over a given database
input:
Question: What is the average, minimum, and maximum age for all French musicians?
Schema: "stadium" "Stadium_ID" int , "Location" text , "Name" text , "Capacity" int , "Highest" int , "Lowest" int , "Average" int , foreign_key: primary key: "Stadium_ID" [SEP] "singer" "Singer_ID" int , "Name" text , "Country" text , "Song_Name" text , "Song_release_year" text , "Age" int , "Is_male" bool , foreign_key: primary key: "Singer_ID" [SEP] "concert" "concert_ID" int , "concert_Name" text , "Theme" text , "Year" text , foreign_key: "Stadium_ID" text from "stadium" "Stadium_ID" , primary key: "concert_ID" [SEP] "singer_in_concert" foreign_key: "concert_ID" int from "concert" "concert_ID" , "Singer_ID" text from "singer" "Singer_ID" , primary key: "concert_ID" "Singer_ID"
=> target:
SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'
When evaluating we query the sqlite database => query result:
[[34.5, 25, 43]]
Format of the database schema
The standardized database schema the model was trained on:
table_name column1_name column1_type column2_name column2_type ... foreign_key: FK_name FK_type from table_name column_name primary key: column_name [SEP]
table_name2 ...
Usage
Here is how to use this model to answer the question on a given context using 🤗 Transformers in PyTorch:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_path = 'gaussalgo/T5-LM-Large-text2sql-spider'
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
question = "What is the average, minimum, and maximum age for all French musicians?"
schema = ""stadium" "Stadium_ID" int , "Location" text , "Name" text , "Capacity" int , "Highest" int , "Lowest" int , "Average" int , foreign_key: primary key: "Stadium_ID" [SEP] "singer" "Singer_ID" int , "Name" text , "Country" text , "Song_Name" text , "Song_release_year" text , "Age" int , "Is_male" bool , foreign_key: primary key: "Singer_ID" [SEP] "concert" "concert_ID" int , "concert_Name" text , "Theme" text , "Year" text , foreign_key: "Stadium_ID" text from "stadium" "Stadium_ID" , primary key: "concert_ID" [SEP] "singer_in_concert" foreign_key: "concert_ID" int from "concert" "concert_ID" , "Singer_ID" text from "singer" "Singer_ID" , primary key: "concert_ID" "Singer_ID""
input_text = " ".join(["Question: ",question, "Schema:", schema])
model_inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**model_inputs, max_length=512)
output_text = tokenizer.decode(outputs, skip_special_tokens=True)
print("SQL Query:")
print(output_text)
returns:
SQL Query:
SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'
Training
The model has been trained using Adaptor library 0.2.1, on training splits of Spider and Spider-syn datasets with the following parameters:
training_arguments = AdaptationArguments(output_dir="train_dir",
learning_rate=5e-5,
stopping_strategy=StoppingStrategy.ALL_OBJECTIVES_CONVERGED,
stopping_patience=8,
save_total_limit=8,
do_train=True,
do_eval=True,
bf16=True,
warmup_steps=1000,
gradient_accumulation_steps=8,
logging_steps=10,
eval_steps=200,
save_steps=1000,
num_train_epochs=10,
evaluation_strategy="steps")