Why prompt and schema is generated as part of model output?

#3
by Aiforfun - opened

So far i am pretty satisfy with accuracy of the model but one thing is annoying me is, prompt & schema part of model output.
What if my schema is huge around 250 columns?
Output is:
CREATE TABLE stadium (
stadium_id number,
location text,
name text,
capacity number,
highest number,
lowest number,
average number
)

CREATE TABLE singer (
singer_id number,
name text,
country text,
song_name text,
song_release_year text,
age number,
is_male others
)

CREATE TABLE concert (
concert_id number,
concert_name text,
theme text,
stadium_id text,
year text
)

CREATE TABLE singer_in_concert (
concert_id number,
singer_id text
)

-- Using valid SQLite, answer the following questions for the tables provided above.

-- What is the maximum, the average, and the minimum capacity of stadiums ?

SELECT MAX(capacity), AVG(capacity), MIN(capacity) FROM stadium
Expected output is:
only generated sql query SELECT MAX(capacity), AVG(capacity), MIN(capacity) FROM stadium
How can i get only sql query to be generated?

NumbersStation org

Hi @Aiforfun ,

Thanks for your interest in our work. We've also provided several examples of using our model in the real world and how only to get the generated SQL. Please check here: https://github.com/NumbersStationAI/NSQL/tree/main/examples

Hi @senwu ,
Thanks for the reply.
My query was more related to DB schema e.g. if i have more than 100 DB tables and providing the schema for all those tables makes prompt very huge and model takes very long time to generate the query and model.generate(input_ids, max_length=600) function returns very huge token which are related to DB schema. Then striping out all the schema from output to print only generated SQL.
Other llms don't generated the given prompt back in fact they generate only SQL, so is there any option to generate only SQL not the given prompt?
Is there any better way to give the schema for larger DB tables considering more than 100 bigger tables?

NumbersStation org

Hi @Aiforfun ,

The input sequence length will affect the model generation time. A longer input sequence will take a longer time for the model to generate output assuming the output sequence has the same/similar length. Other CausalLM-based FM will also return the original input back as the prefix of the generated sequence but you can control the max token length for the newly generated output and you can also only print the newly generated output by filtering out the original input sequence. Here is an example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B", torch_dtype=torch.bfloat16)

text = """CREATE TABLE stadium (
    stadium_id number,
    location text,
    name text,
    capacity number,
)

-- Using valid SQLite, answer the following questions for the tables provided above.

-- how many stadiums in total?

SELECT"""

input_ids = tokenizer(text, return_tensors="pt").input_ids

# generated_ids = model.generate(input_ids, max_length=500)
generated_ids = model.generate(input_ids, max_new_tokens=500) # <---- use max_new_tokens to control the maximum number of tokens to generate, ignoring the number of tokens in the prompt.
# print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
print(tokenizer.decode(generated_ids[0][len(input_ids[0]):], skip_special_tokens=True)) # <---- filter out the input sequence

Hi @senwu ,
Awesome, this is what i was looking for. Much appreciated for your prompt response.
I have another issue since i have more than 100 DB tables and model requires tables schema provided in prompt, hence prompt becomes very huge for all 100 DB tables' schema.
I am trying with prompt chunking but facing some issues to make it working, so can you please guide me how to do it chunking for very large prompt with NumbersStation/nsql-llama-2-7B model or any other better approach to handle very large prompt?

NumbersStation org

The NumbersStation/nsql-llama-2-7B model supports context length 4K so you need to fit your prompt within that restriction. In practice, we recommend filtering out irrelevant tables and columns with some retrival mechanism to make the prompt fit into the model. Hope it helps.

Hello @senwu ,
Thanks for the clarification.
Could you please guide or any pointer here for "we recommend filtering out irrelevant tables and columns with some retrieval mechanism to make the prompt fit into the model"?
How can we do it?

NumbersStation org

One option is to encode the table/column with sentence_transformers and filter them by the similarity score.

Hello @senwu ,
Thanks for the pointer.
What is the right format of prompt for few shot learning on NumbersStation/nsql-llama-2-7B model?
It is not providing in model card.

NumbersStation org

This particular model is intended for the 0-shot scenario, which is the most common real-world use case. If you're conducting experiments and wish to include demonstrations from the same database, you can add several question/SQL pairs at the end. If you have any suggestions on how to enable multi-database demonstrations, please let us know, and we'll try to incorporate them in the next release.

Hi, will join queries work between tables, I tried but didn't succeed

NumbersStation org

Yes, the model supports joining over multiple tables. You could adjust the prompt to give the model more information.

Thanks for your reply, can you please share any examples, when I tried I wasn’t able to make that work

NumbersStation org

Here is an example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B", torch_dtype=torch.bfloat16)

text = """CREATE TABLE singer (\n    singer_id number,\n    name text,\n    birth_year number,\n    net_worth_millions number,\n    citizenship text\n)\n\nCREATE TABLE song (\n    song_id number,\n    title text,\n    singer_id number,\n    sales number,\n    highest_position number\n)\n\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\n\n-- For each singer name, what is the total sales for their songs?\n\nSELECT"""

input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

The output would be something like this:

CREATE TABLE singer (
    singer_id number,
    name text,
    birth_year number,
    net_worth_millions number,
    citizenship text
)

CREATE TABLE song (
    song_id number,
    title text,
    singer_id number,
    sales number,
    highest_position number
)


-- Using valid SQLite, answer the following questions for the tables provided above.

-- For each singer name, what is the total sales for their songs?

SELECT T2.name, SUM(T1.sales) FROM song AS T1 JOIN singer AS T2 ON T1.singer_id = T2.singer_id GROUP BY T2.name

Hi @senwu ,
I see model is facing challenges to generate correct SQL queries if some short of calculation is needed in SQL. I tried to generate SQL query which should have calculated the ratio in where clause but ratio is not being calculated.
CREATE TABLE work_order (
id NUMBER,
property_id NUMBER,
cost FLOAT,
invoice_amount FLOAT,i
entered_date DATE,
due_date DATE,
complete_date DATE,
total_workorder NUMBER,
completed_workorder Number
)

CREATE TABLE property (
id NUMBER,
property_name TEXT,
area FLOAT,
owner_id NUMBER,
city TEXT,
country TEXT
)

CREATE TABLE owner (
id NUMBER,
name TEXT,
salary FLOAT
)

-- Using valid sqlite, answer the following questions for the tables provided above.
-- find property id for which work order ratio is higher than 2

SELECT property_id FROM work_order GROUP BY property_id HAVING COUNT(*) > 2

Correct SQL should be: SELECT property_id FROM work_order WHERE (total_workorder - completed_workorder) / total_workorder > 2
Could you please help what is wrong and what should i do?

NumbersStation org

Try this modified version?

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B", torch_dtype=torch.bfloat16).to(0)

text = """CREATE TABLE work_order (
    id NUMBER,
    property_id NUMBER,
    cost FLOAT,
    invoice_amount FLOAT,
    entered_date DATE,
    due_date DATE,
    complete_date DATE,
    total_workorder NUMBER,
    completed_workorder Number
)

CREATE TABLE property (
    id NUMBER,
    property_name TEXT,
    area FLOAT,
    owner_id NUMBER,
    city TEXT,
    country TEXT
)

CREATE TABLE owner (
    id NUMBER,
    name TEXT,
    salary FLOAT
)

-- Using valid SQLite, answer the following questions for the tables provided above.

-- find property id for which completed work order ratio is lower than 98
"""

input_ids = tokenizer(text, return_tensors="pt").input_ids.to(0)

generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Output:

SELECT property_id FROM work_order WHERE completed_workorder / total_workorder < 98

Hi @senwu ,
Sorry for the confusion.
My question was --find property id for which non completed work order ratio is higher than 2
So generated query should be SELECT property_id FROM work_order WHERE (total_workorder - completed_workorder) / total_workorder > 2
But model generated query is SELECT property_id FROM work_order WHERE completed_workorder / total_workorder > 2

NumbersStation org

Hi @Aiforfun ,
After another look at your example, I think the issue is about the schema since the work_order table is at the granularity of work order which means each row is a work order. The total_workorder NUMBER is measuring the total work orders for an entire property. Meanwhile the completed_workorder NUMBER is a boolean over a single work order. So these two attributes don’t type check with the schema. Instead, you can have an is_completed_workorder NUMBER to indicate the work order status with the total worker attribute. Here is the updated schema:

CREATE TABLE work_order (
    id NUMBER,
    property_id NUMBER,
    cost FLOAT,
    invoice_amount FLOAT,
    entered_date DATE,
    due_date DATE,
    complete_date DATE,
    is_completed_workorder NUMBER
)
CREATE TABLE property (
    id NUMBER,
    property_name TEXT,
    area FLOAT,
    owner_id NUMBER,
    city TEXT,
    country TEXT
)
CREATE TABLE owner (
    id NUMBER,
    name TEXT,
    salary FLOAT
)

And for the question about Give me the ratio of not completed to total work orders, the model outputs:

SELECT (SELECT COUNT(*) FROM work_order WHERE is_completed_workorder = 0) / (SELECT COUNT(*) FROM work_order)

Hi @senwu ,
I want to train NumbersStation/nsql-llama-2-7B on my own dataset so i have to prepare my own dataset similar to the dataset you shared in github.

So what would be the right approach, pretrain and then fine tune or directly fine tune as per fin tune notebook you shared in github?

What are all hyperparameters shall i consider in fine tuning to get best accuracy in generation of sql?

NumbersStation org

You can directly fine-tune with your own instruct data for your own finetune. You can do full fine-tuning or do parameter-efficient fine-tuning (e.g., lora (notebook)). You probably need to turn the learning rate (for example 1e-5) and batch size based on your task.

Hi @senwu ,
What is the right format of prompt on NumbersStation/nsql-llama-2-7B model? Just as don't use LEFT JOIN。

NumbersStation org

You can find the prompt format in the model readme.

Hi @senwu ,

  1. NumbersStation/nsql-llama-2-7B model is not making joins at all for me. My tables contain string dtype and require no calculation. When I ask for multiple variables from multiple tables, it hallucinates and links all variables to a single table with the primary key of that table only. How do I ensure it is making joins properly?
  2. How do I add more information like column metadata/description in the prompt? Could you provide an example for the same. Right now my prompt is exactly in the model readme prompt format that you guys have provided.
  3. If the question asked contains a variable that is present in multiple tables, can I have the output contain all of those tables instead of just one? Currently, it is giving me the variable from just 1 table.
    Looking forward to your reply. Thanks in advance.

It would be great if the model supported a few-shot learning. It would allow the system to learn based on previous mistakes, where users can guide the model on what SQL they expect. Please let me know how can I help? Can I provide training examples?

NumbersStation org

Yes, it would be great if you can share training examples

Sign up or log in to comment