Difficult prompt you may want to add to training data

#1
by tordbb - opened

Thanks for an impressive sql coder!

Although it came closer than many other sql coder models I have seen on huggingface, it still did not quite nail the prompt below, which GPT-3.5 seems to nail.
Feel free to add the prompt to your training data if interested in improving the model further.

Question:
"Among the artists having concerts in year 2020, which artist has a song whose title equals the name of the stadium on which the concert takes place?"

Database Schema:
"""
CREATE TABLE stadium (
stadium_id number,
location text,
name text,
capacity number,
highest number,
lowest number,
average number
)

CREATE TABLE singer (
singer_id number,
name text,
country text,
song_name text,
song_release_year text,
age number,
is_male others
)

CREATE TABLE concert (
concert_id number,
concert_name text,
theme text,
stadium_id text,
year text
)

CREATE TABLE singer_in_concert (
concert_id number,
singer_id text
)
"""

Invalid response, as retreived at https://defog.ai/sqlcoder-demo/:
"""
SELECT singer.name
FROM singer_in_concert join singer on singer_in_concert.singer_id = singer.singer_id join concert on singer_in_concert.concert_id = concert.concert_id
WHERE concert.year = '2020'
and singer.song_name = stadium.name
GROUP BY singer.name;
"""

Valid(?) response, as given by GPT-3.5:
"""
SELECT s.name AS artist_name
FROM singer s
JOIN singer_in_concert sic ON s.singer_id = sic.singer_id
JOIN concert c ON sic.concert_id = c.concert_id
JOIN stadium st ON c.stadium_id = st.stadium_id
WHERE c.year = '2020' AND s.song_name = st.name;
"""

Defog.ai org

HI there, thank you for testing out the model – and for opening an issue! We will definitely incorporate queries like this (without column descriptions of join hints) in our training data in the future. Currently, our model performs best when column descriptions and join hints are defined.

This is because – unlike gpt-3.5-turbo – we have not yet done instruction fine-tuning with RLHF or Reward Modeling. We will do that soon.

With the metadata defined like below, the model does give the right answer to this question. But in the future, we hope to start returning high quality without any metadata definitions.

Query generated with additional metadata defined

SELECT singer.name,
       singer.song_name
FROM   singer_in_concert join singer on singer_in_concert.singer_id = singer.singer_id join concert on singer_in_concert.concert_id = concert.concert_id join stadium on concert.stadium_id = stadium.stadium_id
WHERE  concert.year = '2020'
   and singer.song_name = stadium.name
ORDER BY singer.name;

Additional metadata

CREATE TABLE stadium (
  stadium_id number, --unique id for each stadium
  location text, --city and state
  name text, --name of the stadium
  capacity number, --maximum number of people that can be seated
  highest number, --highest number of people that have attended a concert
  lowest number, --lowest number of people that have attended a concert
  average number --average number of people that have attended a concert
)

CREATE TABLE singer (
  singer_id number, --unique id for each singer
  name text, --name of the singer
  country text, --country of origin of the singer
  song_name text, --name of the song
  song_release_year text, --year the song was released
  age number, --age of the singer
  is_male bool --whether or not the singer is male
)

CREATE TABLE concert (
  concert_id number, --unique id for each concert
  concert_name text, --name of the concert
  theme text, --theme of the concert
  stadium_id text, --id of the stadium where the concert takes place
  year text --year the concert takes place
)

CREATE TABLE singer_in_concert (
  concert_id number, --id of the concert
  singer_id text --id of the singer
)

-- stadium.stadium_id can be joined with concert.stadium_id
-- singer.singer_id can be joined with singer_in_concert.singer_id
-- concert.concert_id can be joined with singer_in_concert.concert_id

Sign up or log in to comment